| 00:06:55 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:07:14 | | AmAnd0A joins |
| 00:08:30 | | AlsoHP_Archivist quits [Client Quit] |
| 00:08:49 | | HP_Archivist (HP_Archivist) joins |
| 00:14:13 | | AmAnd0A quits [Ping timeout: 258 seconds] |
| 00:14:22 | | AmAnd0A joins |
| 00:16:38 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
| 00:18:08 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 00:18:25 | | AmAnd0A joins |
| 00:30:07 | <@JAA> | Everything accessible on the Knowledge Adventure CDN and present as of my initial listing on 2023-06-14 or the relisting about 6 hours ago should now be archived. |
| 00:30:12 | <@JAA> | betamax, nicolas17: ^ |
| 00:43:03 | | lk quits [Ping timeout: 265 seconds] |
| 00:54:30 | | dumbgoy joins |
| 00:55:43 | | icedice quits [Client Quit] |
| 00:56:32 | | dumbgoy quits [Client Quit] |
| 00:57:23 | | dumbgoy joins |
| 01:03:29 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 01:03:49 | | AmAnd0A joins |
| 01:55:08 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 01:55:25 | | AmAnd0A joins |
| 02:22:08 | | killsushi quits [Ping timeout: 265 seconds] |
| 03:19:10 | | dumbgoy quits [Ping timeout: 265 seconds] |
| 03:51:38 | <h2ibot> | FireonLive edited Current Projects (-10, move Tiki to recently finished): https://wiki.archiveteam.org/?diff=50050&oldid=50046 |
| 04:22:28 | | TastyWiener95 quits [Quit: So long, farewell, auf wiedersehen, good night] |
| 04:30:02 | | HP_Archivist quits [Read error: Connection reset by peer] |
| 04:30:44 | | TastyWiener95 (TastyWiener95) joins |
| 04:56:17 | | IDK quits [Client Quit] |
| 05:09:55 | | hitgrr8 joins |
| 05:26:56 | <fireonlive> | Visa to Acquire Pismo for US$ 1 billion in cash: https://www.pismo.io/blog/visa-to-acquire-pismo/ |
| 05:29:04 | <fireonlive> | "Pismo will retain our founders and current management team. The transaction is subject to regulatory approvals and other customary closing conditions and is expected to close by the end of 2023.", website probably not super in danger i guess |
| 05:40:50 | | AmAnd0A quits [Remote host closed the connection] |
| 05:41:03 | | AmAnd0A joins |
| 05:41:16 | | BlueMaxima quits [Client Quit] |
| 06:35:08 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 06:47:26 | | bf_ joins |
| 07:01:20 | | flashfire42|m joins |
| 07:55:04 | | Arcorann (Arcorann) joins |
| 08:05:13 | | IDK (IDK) joins |
| 08:11:05 | <flashfire42|m> | Is there any way to monitor the offload of the targets? I think someone was saying a few were getting full or close to |
| 08:18:42 | | Doomahol1 quits [Read error: Connection reset by peer] |
| 08:19:32 | | Doomahol1 joins |
| 08:34:11 | | second (second) joins |
| 08:37:56 | | sec^nd quits [Ping timeout: 245 seconds] |
| 08:37:56 | | second is now known as sec^nd |
| 08:46:34 | <imer> | flashfire42|m: nope, ideally targets run at near-full anyways to apply backpressure - if they were empty that just means IA can accept more data and we're archiving too slow ;) |
| 08:47:48 | <flashfire42|m> | Heh I mean yeah but there are some projects currently paused because we were grabbing too much data for IA to keep up |
| 08:49:30 | <imer> | yeah. not quite sure what the status there is. someone else would have to chime in what is going to happen there, if anything |
| 08:50:05 | <imer> | could be a matter of waiting it out until things slow down naturally or there might be improvements on the IA/AT side so things can go faster |
| 08:50:48 | <imer> | a lot of data though, so all not easy I can imagine |
| 09:19:27 | <masterx244|m> | IA is a common bottleneck, the S3 upload "loading bays" are the bottleneck pretty often. AT can suckle data out faster than they can be ingested there |
| 10:00:01 | | railen63 quits [Remote host closed the connection] |
| 10:00:17 | | railen63 joins |
| 10:01:47 | | SF quits [Ping timeout: 265 seconds] |
| 10:03:57 | | sec^nd quits [Remote host closed the connection] |
| 10:05:20 | | sec^nd (second) joins |
| 10:14:12 | | SF joins |
| 10:50:36 | <betamax> | JAA: that's amazing, thanks so much! |
| 10:51:20 | <betamax> | Would you be able to share your relisting from a day or so ago? My friend is working with others to reverse engineer the server for the game and having the full file listing would be very helpful |
| 10:54:57 | | Chris5010 quits [Ping timeout: 265 seconds] |
| 12:07:31 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 12:07:57 | | jacksonchen666 (jacksonchen666) joins |
| 12:10:57 | | justmolamola joins |
| 12:14:04 | | justmolamola quits [Client Quit] |
| 12:20:12 | | sonick quits [Client Quit] |
| 12:23:16 | | justmolamola joins |
| 12:31:59 | | justmolamola quits [Client Quit] |
| 12:32:58 | | justmolamola joins |
| 12:35:16 | | W7RFa6AbNFz quits [Read error: Connection reset by peer] |
| 12:35:39 | | W7RFa6AbNFz joins |
| 12:48:25 | <h2ibot> | OrIdow6 edited Egloos (+649, Account of the grab): https://wiki.archiveteam.org/?diff=50051&oldid=50043 |
| 12:49:41 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 12:50:17 | | AmAnd0A joins |
| 12:57:09 | <@OrIdow6> | No reply from Wysp.ws |
| 13:05:06 | | nulldata quits [Ping timeout: 258 seconds] |
| 13:09:23 | | Chris5010 (Chris5010) joins |
| 13:14:28 | | froschgrosch joins |
| 13:15:21 | <Hans5958> | Are there archives of the leaderboards for past projects? |
| 13:21:06 | | justmolamola quits [Client Quit] |
| 13:26:30 | <Chris5010> | If you know the project name, you can use that in the normal tracker URL: https://tracker.archiveteam.org/[projectName]/. For example, the project for Enjin is done, but the leaderbord is still accessible: https://tracker.archiveteam.org/enjin/ |
| 13:46:10 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 13:49:40 | | Dango360_ (Dango360) joins |
| 13:53:29 | | Dango360 quits [Ping timeout: 252 seconds] |
| 13:54:31 | | froschgrosch quits [Remote host closed the connection] |
| 13:56:20 | | Megame quits [Client Quit] |
| 14:05:04 | | MrRadar_ (MrRadar) joins |
| 14:06:08 | | MrRadar quits [Ping timeout: 252 seconds] |
| 14:31:20 | | HP_Archivist (HP_Archivist) joins |
| 14:32:47 | <h2ibot> | Yts98 edited LINE BLOG (+139, Add link to data): https://wiki.archiveteam.org/?diff=50052&oldid=49955 |
| 15:01:50 | <Hans5958> | Where is the repo to (at least the front end of) tracker.archiveteam.org? |
| 15:03:41 | | BigBrain_ (bigbrain) joins |
| 15:04:36 | | BigBrain quits [Ping timeout: 245 seconds] |
| 15:09:33 | | Dango360_ quits [Client Quit] |
| 15:09:37 | <pokechu22> | https://github.com/ArchiveTeam/universal-tracker I think? |
| 15:09:43 | | Dango360 (Dango360) joins |
| 15:11:35 | | Arcorann quits [Ping timeout: 252 seconds] |
| 15:11:38 | | dumbgoy joins |
| 15:13:43 | <Hans5958> | Really? Probably want to contribute some code but looks "dead" |
| 15:16:56 | <h2ibot> | Manu edited Deathwatch (+261, Stitcher will shut down end of August): https://wiki.archiveteam.org/?diff=50053&oldid=50047 |
| 15:17:00 | | dumbgoy_ joins |
| 15:17:56 | <h2ibot> | Noxian edited Tumblr (+0, /* See also */ latest version of TumblThree): https://wiki.archiveteam.org/?diff=50054&oldid=49141 |
| 15:17:57 | <h2ibot> | Hans5958 edited Egloos (-12, Little bit of rewording): https://wiki.archiveteam.org/?diff=50055&oldid=50051 |
| 15:17:58 | <h2ibot> | Exorcism edited Tiki (+23): https://wiki.archiveteam.org/?diff=50056&oldid=50049 |
| 15:17:59 | <h2ibot> | Exorcism uploaded File:Tiki logo.png: https://wiki.archiveteam.org/?title=File%3ATiki%20logo.png |
| 15:18:57 | <h2ibot> | Exorcism edited Deathwatch (+0): https://wiki.archiveteam.org/?diff=50058&oldid=50053 |
| 15:19:36 | | sec^nd quits [Ping timeout: 245 seconds] |
| 15:20:23 | | dumbgoy quits [Ping timeout: 252 seconds] |
| 15:22:02 | | Hackerpcs quits [Ping timeout: 252 seconds] |
| 15:22:11 | | froschgrosch joins |
| 15:23:10 | | froschgrosch quits [Remote host closed the connection] |
| 15:23:39 | | Hackerpcs (Hackerpcs) joins |
| 15:41:40 | <@arkiver> | egloos, tiki, and lineblog project are done! |
| 15:42:15 | <@arkiver> | tracker front page is becoming less busy :P |
| 15:43:15 | <yts98> | arkiver: great! now I want to propose a warrior project for Xuite :p https://github.com/yts98/xuite-grab |
| 15:44:13 | <fireonlive> | i read that as xtube which is both incorrect and also long gone (and already done) :c |
| 15:44:24 | <threedeeitguy> | tiki was fun. my first top 10 finish :D |
| 15:47:26 | <fireonlive> | haha yeah first where i was near the top :p |
| 15:49:03 | <h2ibot> | Yts98 edited Current Projects (+0, Move LINE BLOG to recently finished): https://wiki.archiveteam.org/?diff=50059&oldid=50050 |
| 15:50:19 | <rktk> | Just wanted to throw this out as a forum to archive: https://memoriesoffear.jcink.net |
| 15:50:40 | <rktk> | They did a number of translated games, one namely Toilet in Wonderland (which Vinny Vinesauce played on stream) |
| 15:50:42 | <fireonlive> | Hans5958: looks like that's the one yeah |
| 15:50:45 | <rktk> | https://memoriesoffear.jcink.net/index.php?showtopic=56 |
| 15:53:04 | <h2ibot> | Yts98 edited LINE BLOG (+1, Finish the project): https://wiki.archiveteam.org/?diff=50060&oldid=50052 |
| 15:53:09 | <fireonlive> | i imagine everyone is quite busy with a lot of other things (including things outside of archiveteam) so it's not high priority as other stuff |
| 15:54:02 | <fireonlive> | yts98: :D |
| 15:54:17 | <rktk> | fireonlive, do you mean that forum I linked sorry, or replying to someone else |
| 15:54:32 | <fireonlive> | rktk: oh sorry, replying to Hans5958 |
| 15:54:32 | <rktk> | If there is a recommended way of scraping a forum like that, I have no issue to do it myself |
| 15:54:40 | <rktk> | ah ok fireonlive :) |
| 15:54:42 | <fireonlive> | :) |
| 15:54:58 | <fireonlive> | regarding the https://tracker.archiveteam.org codebase |
| 16:09:45 | <pokechu22> | rktk: Probably archivebot, but it's fairly full currently. That one should be pretty easy to run though since it's small |
| 16:10:04 | <rktk> | pokechu22, could I run an archivebot myself locally? |
| 16:10:08 | <rktk> | or should I just do an wget mirror |
| 16:10:24 | <@arkiver> | yts98: why JSObj? |
| 16:10:47 | <pokechu22> | ArchiveBot isn't designed to be run locally, https://github.com/ArchiveTeam/grab-site is the more usable equivalent |
| 16:10:56 | <pokechu22> | There's also a forum-dl project or something like that that might be usable |
| 16:11:23 | <yts98> | arkiver: to deal with JS objects embedded in the HTML. |
| 16:11:32 | <pokechu22> | wget's also fine, but wouldn't end up on web.archive.org (though anything a random person does probably wouldn't end up there) |
| 16:11:51 | <pokechu22> | Looks like they also have mediafire links so those will need to be put into #mediaonfire |
| 16:12:25 | <yts98> | I found simply replacing single quotes with double quotes may still cause errors |
| 16:13:18 | <@arkiver> | yts98: on the item types, can you please make then a bit more descriptive? |
| 16:14:00 | <pokechu22> | Looks like there's actually a lot of forums under jcink.net, so that's something to check later |
| 16:15:12 | <@arkiver> | yts98: looks pretty good! |
| 16:17:07 | <rktk> | pokechu22, yeah this is just a random personal grab. and i could save to warc, mainly just as a means of throwing it on archive as an object, rather than web archive |
| 16:17:19 | <rktk> | pokechu22, yeah definitely something worth looking at |
| 16:18:39 | <yts98> | I chose very short item type names because the wiki said "Because the Tracker uses Redis as its database, memory usage is a concern." |
| 16:18:42 | <@arkiver> | let's make a channel for xuite! i'm not sure if this word has a meaning, perhaps we can have a play on words in the language of this word |
| 16:18:57 | <@arkiver> | yts98: ah. well lists are mostly offloaded, so not a huge concern now |
| 16:19:35 | <yts98> | arkiver: watch this video. |
| 16:19:35 | <yts98> | https://vlog.xuite.net/play/Qm9leW9BLTEzODg4Ni5mbHY= |
| 16:19:43 | <threedeeitguy> | There's small website that I wish to regularly save a few pages for (usually 1-2 pages a day). The prompt to save the page would be an email notification from said site. I already have extracting the link sorted. Is there an API equivalent of https://web.archive.org/save ? Saving the page is fairly time critical as once items are sold the page is |
| 16:19:43 | <threedeeitguy> | updated and information is removed. |
| 16:20:03 | <pokechu22> | rktk: I've started an archivebot job anyways, shouldn't take too long |
| 16:20:06 | <yts98> | Xuite's slogan is "My Xuite, So Sweet~" |
| 16:20:11 | <rktk> | hurray! pokechu22 |
| 16:20:15 | <@arkiver> | yts98: i see some stuff there like TODOs on handling malformed JSON responses |
| 16:20:33 | <rktk> | someone should save digitalfaq before all the scam evidence is wiped away |
| 16:20:34 | <pokechu22> | threedeeitguy: Pretty sure web.archive.org/save can be treated as an API endpoint, I remember seeing some docs on that, one sec |
| 16:20:41 | <pokechu22> | digitalfaq? |
| 16:21:19 | <@arkiver> | those malformed responses should be caught in write_to_warc, then not be written to WARC, and either be marked for retrying to retrieve, or the item should be aborted. or in rare cases no write to WARC and let it continue as usual if this is an 'error' that is fine |
| 16:21:19 | <yts98> | arkiver: their API sometimes mix cp950 with utf8 |
| 16:21:26 | <pokechu22> | https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit |
| 16:21:46 | <@arkiver> | right, i see. so the error is on our side, not on theirs? |
| 16:21:52 | <@arkiver> | yts98: ^ |
| 16:21:52 | <rktk> | pokechu22, digitalfaq.com |
| 16:22:22 | <pokechu22> | What's the deal with scam evidence? |
| 16:22:51 | <pokechu22> | Looks like it was previously saved August 2022: https://archive.fart.website/archivebot/viewer/job/4ialw |
| 16:23:00 | | sec^nd (second) joins |
| 16:23:05 | <pokechu22> | err, no, those are small enough that saving it probably failed |
| 16:23:38 | <yts98> | arkiver: yes. the error is caused in JSON.lua. |
| 16:25:13 | <threedeeitguy> | pokechu22 thanks, il take a look. It may not be suitable anyway. I just tried a page and its far from clean: https://web.archive.org/web/20230629161553/https://www.stationroadsteam.com/3-12-inch-gauge-union-pacific-big-boy-4-8-8-4-stock-code-11379/# |
| 16:25:57 | <@arkiver> | yts98: i see there is still a change of 'bad data' getting into the WARC, for example I see a check on json["ok"] get_urls. at this point the data is already in the WARC, which it should be if there is an indication of an error |
| 16:26:07 | <@JAA> | betamax: Yeah, everything will be on IA once the upload finishes. |
| 16:26:29 | <@arkiver> | so this json["ok"] check should be in write_to_warc, and then again either retried or items aborted (or accepted in rare cases) if the error is there |
| 16:26:43 | <@arkiver> | there may be other checks in get_urls that should move to write_to_warc |
| 16:26:44 | | Matthww1 quits [Ping timeout: 258 seconds] |
| 16:27:36 | | Matthww1 joins |
| 16:29:10 | <yts98> | arkiver: json["ok"] being false is not rare. It happens when an article is protected by the password, or an user did not activate one of the blog, album, or vlog service. |
| 16:29:24 | <@arkiver> | alright good |
| 16:30:15 | <yts98> | and then I saw thousands of usernames discovered, but the API will respond with "no such user". |
| 16:32:10 | <yts98> | their username search API even returns illegal usernames, possibly manually altered by the moderator to deactivate some accounts |
| 16:33:10 | <@arkiver> | interesting |
| 16:33:11 | <@arkiver> | so |
| 16:33:12 | <@arkiver> | on images |
| 16:33:20 | <@arkiver> | photo.xuite.net, and such |
| 16:33:52 | <@arkiver> | can different items get to the same images? can they be duplicated between items? i see they are now generally always accepted for immediate archiving |
| 16:35:26 | | BigBrain_ quits [Ping timeout: 245 seconds] |
| 16:36:46 | <yts98> | I sent some image URLs in API responses of user item, but some of these images belong to an album, so the current script will grab them twice or more. |
| 16:37:27 | | BigBrain_ (bigbrain) joins |
| 16:37:32 | <@arkiver> | are the URLs for a single image unique? |
| 16:38:11 | <@arkiver> | as in, is it always 3.example.com/image.png, or can there also be 2.example.com/image.png, 3.example.com/image?format=png, etc.? |
| 16:40:16 | <@arkiver> | I see the TODO about false positives. yes, this may produce false positives. but archiving is usually done with the thought of "better discovery too much than too little". so if we are sure everything will be discovered with very strict rules, then that is fine |
| 16:40:52 | <yts98> | for photo.xuite.net, the image URLs are unique; |
| 16:40:52 | <yts98> | when images embedded in blog articles, the service possibly generates another URL that accepts outlinks |
| 16:41:04 | <@arkiver> | but it is often good to keep the rules somewhat relaxed, allow for a possibility of false positives. eliminate these false positives if we find them. and that way perhaps extract/archive more than we initially were under the impression was actually there |
| 16:41:31 | | lk (lk) joins |
| 16:41:36 | <@arkiver> | yts98: "another URL that accepts outlinks" - for an image? what do you mean? |
| 16:44:03 | | BigBrain_ quits [Read error: Connection reset by peer] |
| 16:44:29 | <@arkiver> | yts98: on the video URLs and load balancing. can video URLs to the same video be found in different items? as in, can there be duplicates? (same as what i asked for the photos) |
| 16:45:17 | <@arkiver> | if the a certain video will _only_ be discovered from a single item, then good! and then let's get whatever load balancers they use, Wget-AT will prevent writing duplicate data, while still preserving the URLs. |
| 16:46:06 | <@arkiver> | there will only be duplicate data downloaded on the side of the Warrior, but this extra data will be deduplicated away when written to the WARC. if xuite can handle it, then it's good to get this duplicate data. |
| 16:46:40 | <@arkiver> | because this is not only about purely data preservation, but also about URL preservation. we want to try and cover the entire range of possible URLs, so that those can be found through the Wayback Machine. |
| 16:47:54 | <@arkiver> | so. let's say we have 1.example.com/image.png and 2.example.com/image.png both pointing to the same image. we download them _in the same Wget-AT session_, then they will be deduplicated, while both their URLs are preserved (yes, data will be downloaded twice) |
| 16:48:48 | <@arkiver> | if we have separate items for those two URLs to the same image, then it is likely that those separate items end up in different Wget-AT session, and are not deduplicated, which wastes bytes |
| 16:49:27 | <@arkiver> | if we're talking about 1 TB or so of duplicated data, that is not a big problem. but if it turns into 10 TB or 100 TB of duplicated data, that is a problem |
| 16:51:25 | <@arkiver> | yts98: i see you store data in _data.txt, what is the use of this. we're actually not really using data.txt anymore. in the past data.txt was used to discover items, but nowadays we use backfeed for that. |
| 16:51:41 | <@arkiver> | there is nothing on the targets currently that will do anything with the _data.txt file. |
| 16:52:03 | <yts98> | I did not remember I saw image URL formats other than 1.share.photo.xuite.net in which article. |
| 16:52:03 | <yts98> | Separating images to new items is a reasonable approach. Let's handle them like cdn-obs in lineblog. |
| 16:52:03 | <yts98> | Video URLs may also be checked in user items. But they may expire if we backfeed them as item. |
| 16:52:03 | <yts98> | I thought warc revisit can only be used on the same URL. So warc revisit applies to different URLs when the response body is identical. |
| 16:52:45 | <@arkiver> | yes, on the response body being identical |
| 16:53:36 | <@arkiver> | i see on expiring video URLs. are the video URLs you get through a user item actually used for playback? or are they "just there" in some data blob, while actually only the video URL on the post page is used for playback? |
| 16:54:43 | <@arkiver> | on FlashVars rules - those are not known yet? |
| 16:55:40 | | joe joins |
| 16:56:17 | | IDK quits [Client Quit] |
| 16:57:51 | <@arkiver> | yts98: well overall looks pretty good, i'll be further checking this later! |
| 16:59:13 | <yts98> | the purpose of data.txt is to inspect the metadata not included in item names, including blog_id and every <embed>. |
| 16:59:13 | <yts98> | I've discovered 5 types of FlashVars rules https://wiki.archiveteam.org/index.php/Xuite#Flash-based_creations , but I'm not sure if I missed more. |
| 17:00:22 | <yts98> | arkiver: thanks for taking a look! I learned very much about archiving practices :) |
| 17:00:37 | <@arkiver> | good to hear :) |
| 17:00:54 | <@arkiver> | alright i'm not sure yet about data.txt, will be having a better look later! |
| 17:01:37 | <@arkiver> | (i only actually looked at the code - not the site yet) |
| 17:07:38 | <yts98> | a possible alternative to data.txt is to create a dummy backfeed that does not actually backfeed the items into the project. |
| 17:08:19 | <@arkiver> | that sounds better yes |
| 17:08:36 | <@arkiver> | but i'm not sure if we actually need it, need to do some experiments as well |
| 17:09:02 | | joe quits [Remote host closed the connection] |
| 17:09:04 | <@arkiver> | if there is something unexpected, can item be simply aborted? |
| 17:09:55 | <@arkiver> | i see for example that when an a: item is queued, it is always written to the data.txt as well, that is not needed i think? |
| 17:20:14 | | killsushi joins |
| 17:22:49 | <fireonlive> | gettyimages acquired unsplash earlier in 2021: https://unsplash.com/blog/unsplash-getty/ and looks like they’re jumping on the “oh fuck AI is going to ruin us” bandwagon way too late https://twitter.com/sindresorhus/status/1674390882399801345 |
| 17:23:12 | <fireonlive> | not sure what he means by “removed their free non-API endpoint” though |
| 17:26:25 | <@arkiver> | yts98: i see very explicity extraction of certain URLs, also from the HTML, line 1096 for example. i think this is already handled by the 'general' URLs extraction happening at line 1966? if not, that might be a better place |
| 17:27:07 | <@JAA> | Next AT project: archive everything that has a free API. |
| 17:27:20 | <@arkiver> | this is again coming from the point of "better extract too much than too little" - if we only allow extraction of very specific URLs in very specific places, there is a great risk of missing something. |
| 17:27:51 | <@arkiver> | hmm |
| 17:28:12 | <@arkiver> | or, is this being extracted specifically here to have the certain referer be different than the current URLs we're working on? |
| 17:28:41 | <@arkiver> | in which case it would be good. later it'd be picked up in the 'general' extraction code, but not queued since it was queued before |
| 17:28:52 | <@arkiver> | current URL* |
| 18:17:31 | <fireonlive> | JAA: yeeeeah :| |
| 18:17:52 | <fireonlive> | 🙃 🔫 |
| 18:18:23 | <fireonlive> | they said AI/ML would destroy the internet |
| 18:18:36 | <fireonlive> | i just didn't think it would be in this way |
| 18:28:21 | | sec^nd quits [Ping timeout: 245 seconds] |
| 18:29:05 | | sec^nd (second) joins |
| 18:36:39 | | sec^nd quits [Remote host closed the connection] |
| 18:37:01 | | sec^nd (second) joins |
| 18:47:23 | | nicolas17 joins |
| 18:49:41 | | spirit quits [Client Quit] |
| 18:58:51 | | nulldata joins |
| 19:38:17 | | spirit joins |
| 19:42:16 | | bf_ quits [Ping timeout: 265 seconds] |
| 20:23:57 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
| 20:30:41 | <pokechu22> | tinaja.com looks kinda big so I'm not going to put it into archivebot until we have a little bit more space |
| 20:30:59 | <@arkiver> | let's see |
| 20:31:11 | <@arkiver> | interesting site |
| 20:31:38 | <that_lurker> | seems to have a lot of pdf's so might be big |
| 20:32:23 | <@arkiver> | pokechu22: shall we put it in archivebot anyway? |
| 20:32:32 | <vokunal|m> | I was about to ask what you look for to determine whether it looks big or not. At first glance I figured it looks like it's from the 90s, so small |
| 20:33:05 | <pokechu22> | Currently all the AB pipelines are full because hel3/hel4 are low on disk space because of the general upload backlog to my understanding |
| 20:34:17 | <pokechu22> | Probably we could still queue it though |
| 20:36:59 | | thenes quits [Remote host closed the connection] |
| 20:37:20 | | thenes (thenes) joins |
| 20:39:16 | <that_lurker> | actually those pdf's are not that big so might be something like 50 - 60 gigs at tops |
| 20:40:25 | <that_lurker> | could be good to queue it as you can just pause it in the event that there is no space right? |
| 20:41:21 | <pokechu22> | Alright, queued it |
| 20:41:47 | <pokechu22> | It'll auto-pause when there's no space (< 5 GB I think) |
| 20:42:46 | <that_lurker> | LUL was already started aparently :P |
| 20:43:14 | | eroc1990 (eroc1990) joins |
| 20:48:21 | | sec^nd quits [Ping timeout: 245 seconds] |
| 20:51:24 | | sec^nd (second) joins |
| 21:00:39 | <@arkiver> | pokechu22: general upload backlog to where? |
| 21:00:47 | <@arkiver> | is IA the bottle neck? |
| 21:01:24 | <pokechu22> | I think so? |
| 21:01:31 | <pokechu22> | JAA talked more about it I think |
| 21:01:52 | <pokechu22> | main thing is that if you look at http://archivebot.com/pipelines most machines are full |
| 21:01:57 | <@arkiver> | we need an "ArchiveBot talk" channel |
| 21:02:20 | <@JAA> | arkiver: #down-the-tube and AB used the same rsync target. The former clogged it. |
| 21:02:31 | <@arkiver> | ah |
| 21:02:42 | <@arkiver> | JAA: how about that archivebot talk channel? |
| 21:02:44 | <@JAA> | That comes up every few months or so. It'd be mostly a dead channel probably. |
| 21:03:09 | <@arkiver> | i usually miss messages someone posts to me in #archivebot |
| 21:03:10 | <@arkiver> | oh well |
| 21:03:25 | <@arkiver> | warning to all ^ if I need to really notice the message, don't write to me in #archivebot |
| 21:03:33 | <@JAA> | Make your client log highlights into a separate window. :-) |
| 21:03:55 | <pokechu22> | Relevant messages are on 03:47:37 UTC on June 29 |
| 21:04:23 | <vokunal|m> | This is what I've been using to check. Is this known as a good way to see if they're clogged? https://monitor.archive.org/weathermap/weathermap.html |
| 21:04:57 | <pokechu22> | I don't think the rsync targets would be on there as they're archiveteam infrastructure, but I'm not 100% sure of that |
| 21:05:14 | <vokunal|m> | the switchtc0-200paul has been in the red for around 30+ hours |
| 21:05:17 | <that_lurker> | JAA That is the one thing from znc I would like to have on thelounge |
| 21:05:26 | <@arkiver> | JAA: that would be something i need to figure out and not doing that now |
| 21:05:30 | <fireonlive> | did someone said that archive.org had an issue with (or intentionally?) limited inbound speed? |
| 21:05:38 | <@arkiver> | vokunal|m: no, there can be many reasons |
| 21:05:41 | <fireonlive> | that was oof a while ago though |
| 21:08:07 | <pokechu22> | Oh, it was also mentioned that https://yarus.ru/ was shutting down shortly per https://yarus.ru/post/1989728469 - there's an AB job for it, but there's basically no chance it'll finish completely :| |
| 21:09:52 | | hitgrr8 quits [Client Quit] |
| 21:10:55 | <pokechu22> | ugh, it looks like that site's also JS-based so AB's not going to get anything useful :| (and I think I pushed it too hard and am now getting 403s :|) |
| 21:12:36 | <that_lurker> | no wonder google translate did not work on it :P |
| 21:15:41 | <vokunal|m> | Yeah I was wondering why it wasn't working |
| 21:18:44 | <that_lurker> | Oh and just found out The Lounge has a recent mentions feature |
| 21:19:07 | <that_lurker> | thats convenient |
| 21:19:27 | <fireonlive> | indeed! the @ symbol |
| 21:24:53 | <@arkiver> | pokechu22: checking |
| 21:25:28 | <@arkiver> | pokechu22: are you planning to pull tinaja.com through AB later? |
| 21:25:46 | <pokechu22> | It turns out it was already running in AB since yesterday |
| 21:26:06 | <@arkiver> | oof just seeing yarus in my browser with that loading screen... oof oof |
| 21:26:39 | <@arkiver> | what |
| 21:26:41 | <@arkiver> | June 30? |
| 21:26:49 | <@arkiver> | not again |
| 21:26:50 | <pokechu22> | Several hours ago it was 18 hours |
| 21:26:57 | <pokechu22> | frankly I think it's not possible to get it done |
| 21:27:01 | <pokechu22> | It does have a complete sitemap though |
| 21:27:23 | <@arkiver> | they posted the message you linked today? |
| 21:27:28 | <@arkiver> | for a shutdown tomorrow? |
| 21:28:09 | <pokechu22> | nyuuzyou: ^ |
| 21:28:17 | <pokechu22> | It seems like that's the case though |
| 21:28:26 | <@arkiver> | rewby: are you around? |
| 21:28:36 | <@arkiver> | i'm not sure if we can get a project up in time |
| 21:28:45 | <@arkiver> | but we might need a target for a shutdown tomorrow... announced today :( |
| 21:29:07 | <@rewby|backup> | I'll get you a target if you get a tracker proj and vars in... 30 mins |
| 21:29:14 | <@arkiver> | woah sequential post IDs? |
| 21:29:15 | <@arkiver> | i like it |
| 21:29:16 | <pokechu22> | "У вас будет время сохранить весь свой контент" - "You will have time to save all your content." yeah, sure... |
| 21:29:33 | <pokechu22> | Sequential IDs and a full sitemap as far as I can tell |
| 21:29:44 | <pokechu22> | but on the other hand, javascript |
| 21:29:57 | <@JAA> | <dr_evil_air_quotes.gif> |
| 21:30:02 | <@arkiver> | i'm always skeptical about sitemaps |
| 21:30:43 | <@arkiver> | rewby|backup: alright |
| 21:32:54 | <imer> | they seem to have a rate limit (on api. at least), returns a standard nginx 403 |
| 21:32:54 | <imer> | and now that's changed to another 403 page |
| 21:34:20 | <@arkiver> | imer: proper status code? |
| 21:34:27 | <imer> | yep |
| 21:34:30 | <imer> | 403 |
| 21:34:39 | <@arkiver> | good |
| 21:34:51 | <imer> | here's the content of the non-nginx 403: https://transfer.archivete.am/hyaCY/2023-06-29_23-34-40_wmbgyH3GLo.txt |
| 21:34:57 | <imer> | i've censored my ip with XXX |
| 21:35:57 | <pokechu22> | Archivebot is still getting 403s a while after con=6, d=0 (that wasn't using the API and in fact wasn't even trying to retrieve stuff from the API, though) |
| 21:36:08 | <fireonlive> | ok everyone gather around for a picture |
| 21:36:17 | <fireonlive> | an api actually used a proper http status code |
| 21:36:18 | <imer> | block doesnt seem to be shared across domains, but obviously the site wont work |
| 21:36:20 | <fireonlive> | we need to remember this moment |
| 21:37:37 | <@arkiver> | interesting |
| 21:37:42 | <imer> | i'll keep checking if I get unblocked |
| 21:37:43 | <@arkiver> | IDs sequential with a huge sudden gap |
| 21:38:07 | <imer> | response headers: https://transfer.archivete.am/mTei1/2023-06-29_23-37-36_Qp3eqSS4hN.png content-type is proper as well |
| 21:38:35 | <imer> | no ipv6 (why do I even bother checking this) |
| 21:39:17 | <fireonlive> | one day you'll be rewarded |
| 21:39:21 | <fireonlive> | it's like finding a rare coin |
| 21:39:30 | <fireonlive> | the toyota yarus, https://en.wikipedia.org/wiki/Toyota_Yaris |
| 21:39:32 | <fireonlive> | lol |
| 21:40:33 | <imer> | do we have a channel name yet? i'll throw into the hat #norus if not |
| 21:40:52 | <fireonlive> | nop |
| 21:40:55 | <imer> | words i can arrange sentence to |
| 21:40:56 | <fireonlive> | mine was #yaaaaaaaaasus but that's kinda gay |
| 21:40:57 | <fireonlive> | :p |
| 21:41:01 | <@arkiver> | imer: see what i wrote earlier ;) |
| 21:41:02 | <fireonlive> | also not punny enough |
| 21:41:19 | <@arkiver> | #norus it is |
| 21:41:31 | <fireonlive> | arkiver: you were in the tiki channel |
| 21:41:32 | <fireonlive> | :D |
| 21:42:00 | <@arkiver> | HEY EVERYONE! JAA is not in #norus , let's party there. no one tell JAA please!! |
| 21:54:10 | <h2ibot> | JustAnotherArchivist created ЯRUS (+194, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=%D0%AFRUS |
| 21:55:11 | <h2ibot> | JustAnotherArchivist created Yarus.ru (+19, Redirected page to [[ЯRUS]]): https://wiki.archiveteam.org/?title=Yarus.ru |
| 21:57:36 | | killsushi quits [Ping timeout: 265 seconds] |
| 22:04:12 | <h2ibot> | Pcr edited List of websites excluded from the Wayback Machine (+26, Add TH3D): https://wiki.archiveteam.org/?diff=50063&oldid=49985 |
| 22:07:17 | <fireonlive> | :D |
| 22:10:39 | <thuban> | arkiver: wrt noise in #archivebot, if you use weechat, there are some filters at https://wiki.archiveteam.org/index.php/User:Switchnode |
| 22:50:23 | | Unholy236131 (Unholy2361) joins |
| 22:53:35 | | Unholy23613 quits [Ping timeout: 252 seconds] |
| 22:53:35 | | Unholy236131 is now known as Unholy23613 |
| 23:09:18 | | andrew4 (andrew) joins |
| 23:10:38 | | andrew quits [Ping timeout: 252 seconds] |
| 23:10:38 | | andrew4 is now known as andrew |
| 23:52:21 | | Megame (Megame) joins |