| 00:36:32 | | HackMii quits [Ping timeout: 258 seconds] |
| 00:38:26 | | HackMii (hacktheplanet) joins |
| 00:40:30 | | HP_Archivist quits [Read error: Connection reset by peer] |
| 00:40:57 | | HP_Archivist (HP_Archivist) joins |
| 00:41:49 | | HP_Archivist quits [Read error: Connection reset by peer] |
| 00:42:15 | | HP_Archivist (HP_Archivist) joins |
| 00:44:41 | | HP_Archivist quits [Client Quit] |
| 01:01:21 | | Ruthalas8 (Ruthalas) joins |
| 01:03:22 | | Ruthalas quits [Ping timeout: 258 seconds] |
| 01:03:22 | | Ruthalas8 is now known as Ruthalas |
| 01:03:39 | | dm4v_ joins |
| 01:04:16 | | dm4v quits [Ping timeout: 250 seconds] |
| 01:04:16 | | dm4v_ is now known as dm4v |
| 01:04:16 | | dm4v is now authenticated as dm4v |
| 01:04:16 | | dm4v quits [Changing host] |
| 01:04:16 | | dm4v (dm4v) joins |
| 01:06:31 | | Lord_Nightmare quits [Remote host closed the connection] |
| 01:06:43 | | Lord_Nightmare (Lord_Nightmare) joins |
| 01:08:04 | | Ruthalas6 (Ruthalas) joins |
| 01:08:35 | | Ruthalas quits [Ping timeout: 250 seconds] |
| 01:08:36 | | Ruthalas6 is now known as Ruthalas |
| 01:12:16 | | Lord_Nightmare2 (Lord_Nightmare) joins |
| 01:13:10 | | Lord_Nightmare quits [Read error: Connection reset by peer] |
| 01:13:10 | | Lord_Nightmare2 is now known as Lord_Nightmare |
| 01:13:21 | | aleph quits [Ping timeout: 250 seconds] |
| 01:13:39 | | aleph joins |
| 01:17:34 | <pcr> | Wikipedia has a list of HK newspapers, anything with an online version should probably be grabbed if it isn't already. https://en.wikipedia.org/wiki/List_of_newspapers_in_Hong_Kong?wprov=sfla1 |
| 01:20:24 | <@JAA> | That was mentioned a few hours ago. It would be better to first focus on ones that are critical of mainland China since those are the ones at risk. |
| 01:29:53 | <pcr> | I'd agree that those ones are probably more at risk. |
| 01:37:58 | | HackMii quits [Remote host closed the connection] |
| 01:38:25 | | HackMii (hacktheplanet) joins |
| 02:00:12 | | Ruthalas quits [Client Quit] |
| 02:58:40 | | ThreeHM quits [Ping timeout: 250 seconds] |
| 03:00:39 | | ThreeHM (ThreeHeadedMonkey) joins |
| 03:39:19 | | lennier2 joins |
| 03:40:09 | | lennier1 quits [Ping timeout: 258 seconds] |
| 03:40:10 | | lennier2 is now known as lennier1 |
| 03:50:53 | | qw3rty__ joins |
| 03:54:34 | | qw3rty_ quits [Ping timeout: 250 seconds] |
| 04:04:06 | | lennier1 quits [Ping timeout: 250 seconds] |
| 04:04:25 | | lennier1 (lennier1) joins |
| 04:16:11 | | monoxane quits [Ping timeout: 258 seconds] |
| 04:28:29 | | monoxane (monoxane) joins |
| 04:48:09 | | monoxane4 (monoxane) joins |
| 04:50:28 | | monoxane quits [Ping timeout: 250 seconds] |
| 04:50:28 | | monoxane4 is now known as monoxane |
| 05:15:45 | | sonick quits [Quit: Connection closed for inactivity] |
| 05:19:51 | <Barto> | JAA: apple daily is definitely the biggest risk |
| 05:23:52 | | nerdguy1138 (nerdguy1138) joins |
| 05:25:55 | | Ruthalas (Ruthalas) joins |
| 05:27:51 | | nerdguy1138 quits [Client Quit] |
| 05:45:15 | | Doranwen quits [Remote host closed the connection] |
| 05:45:37 | | Doranwen (Doranwen) joins |
| 06:03:08 | | DogsRNice quits [Read error: Connection reset by peer] |
| 06:04:11 | | achivarin quits [Remote host closed the connection] |
| 06:44:32 | | KRG quits [Ping timeout: 258 seconds] |
| 07:33:25 | | Monotoko joins |
| 07:33:44 | <Monotoko> | Hey guys, do you have a paid account for Apple Daily? |
| 07:34:06 | <Monotoko> | if not I can provide one |
| 07:36:29 | <Ryz> | Hello Monotoko, please stick around, other people will ask you stuff regards to Apple Daily |
| 07:36:39 | <Ryz> | Right now I think it's low activity here right now |
| 07:36:59 | <Monotoko> | Sure, I have zero experience archiving but if you're hitting paywalls or anything I'm happy to help |
| 07:37:46 | <Ryz> | JAA or arkiver or anyone involved with dealing with Apple Daily ^ |
| 07:40:36 | <@OrIdow6> | J A A knows all |
| 07:41:12 | <@OrIdow6> | Thank you Monotoko, as far as we have observed, and as I know, we don't need that |
| 07:41:47 | <Ryz> | I'm curious if there's anything that can only be accessed through a paid account~ |
| 07:42:38 | <@OrIdow6> | Possible |
| 07:42:50 | <@OrIdow6> | From what I've heard, both the articles and videos work fine w/o one |
| 07:43:50 | <@OrIdow6> | If there is, it would need special handling |
| 07:44:14 | <@OrIdow6> | Since it would be questionable to get paid content and make it freely available online, even if they went defunct |
| 07:45:04 | <Monotoko> | I don't believe there is, I've looked at some of the stuff I can access and it seems freely accessible - I mostly subscribed to support them back during the protests |
| 07:45:49 | <Monotoko> | at least for my account, there is "Apple Daily |
| 07:56:59 | | Monotoko quits [Remote host closed the connection] |
| 08:09:18 | | Zerote joins |
| 08:12:19 | | Zerote__ quits [Ping timeout: 258 seconds] |
| 08:33:36 | | Atom-- joins |
| 08:34:56 | | Atom quits [Ping timeout: 250 seconds] |
| 09:24:08 | | noteness quits [Write error: Broken pipe] |
| 09:24:08 | | HackMii quits [Write error: Broken pipe] |
| 09:24:29 | | HackMii (hacktheplanet) joins |
| 09:27:03 | | noteness (noteness) joins |
| 09:58:05 | | achivarin (achivarin) joins |
| 10:29:12 | | lorwp (lorwp) joins |
| 10:31:35 | | lorwp quits [Changing host] |
| 10:31:35 | | lorwp (lorwp) joins |
| 10:31:38 | | Matthww quits [Quit: The Lounge - https://thelounge.chat] |
| 10:41:22 | | Matthww8 joins |
| 10:50:24 | | LeighR (LeighR) joins |
| 10:57:41 | | @chfoo quits [Remote host closed the connection] |
| 10:57:59 | | chfoo (chfoo) joins |
| 10:57:59 | | @ChanServ sets mode: +o chfoo |
| 11:00:43 | | LeighR leaves |
| 11:01:24 | | billy549 quits [Ping timeout: 250 seconds] |
| 11:21:26 | | billy549 (Billy549) joins |
| 11:32:35 | <@JAA> | OrIdow6: Your wiki account is automoderated now. |
| 11:33:57 | | Vista2003 joins |
| 11:34:16 | <@EggplantN> | Oridow6 I assume we’re gonna make AppleDaily a warrior project? |
| 11:34:20 | <@EggplantN> | We’ve got discovery done |
| 11:36:16 | <@EggplantN> | Hey Vista2003 so we’ve fed all the urls we’ve found of articles into our #// (urls) project. This project just grabs the page and no media etc. This was a quick effort yesterday to grab all we could while we discuss the best way to get a proper grab. |
| 11:36:42 | <Vista2003> | ok, thanks for the info |
| 11:38:17 | <@arkiver> | EggplantN: why a warrior project? |
| 11:38:30 | <@arkiver> | images? |
| 11:38:56 | <Jake> | (I think we had a AB job for the videos too?) |
| 11:39:03 | <@JAA> | Videos ran through AB last night, yeah. |
| 11:39:13 | <@EggplantN> | Yeah? Well I thought that’s what Oridow6 was up too arkiver |
| 11:39:16 | <@EggplantN> | or am I mistaken |
| 11:39:17 | <@arkiver> | and it actually for the videos as well? |
| 11:39:56 | <@EggplantN> | Either way. It’s probably time for a grab of HK media in general. |
| 11:40:03 | <@arkiver> | i mean, it AB actually got the videos as well? |
| 11:40:14 | <@JAA> | Yes, this was the .m3u8 and the .ts segments listed in them. |
| 11:40:20 | <@arkiver> | good |
| 11:40:20 | <@JAA> | jodizzle prepared that list. |
| 11:42:45 | | jtagcat quits [Quit: Bye!] |
| 11:45:49 | | jtagcat (jtagcat) joins |
| 12:03:12 | <@JAA> | There, created a wiki page for it collecting everything that has happened so far: https://wiki.archiveteam.org/index.php/Apple_Daily |
| 12:06:16 | | KRG joins |
| 12:06:16 | | KRG is now authenticated as KRG |
| 12:07:00 | | KRG quits [Remote host closed the connection] |
| 12:13:35 | <rewby> | JAA: Maybe list the list of sitemap urls + urls found in sitemaps that I posted yesterday on the wiki. So we can easily find it for future steps |
| 12:14:13 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 12:15:05 | | KRG joins |
| 12:15:05 | | KRG is now authenticated as KRG |
| 12:20:25 | <@JAA> | rewby: Yup, done. |
| 12:31:37 | <@OrIdow6> | EggplantN arkiver: I wasn't working on that, just talking about it |
| 12:31:41 | <@OrIdow6> | Thank you JAA |
| 12:32:48 | <@OrIdow6> | And thank you for making this wiki page also |
| 12:58:33 | <@OrIdow6> | Anyhow, apparently GREE removes and/or makes inaccessible (I still don't understand it) some material on the 24th, not the 26th, so I need to do that |
| 13:07:15 | | KRG quits [Read error: Connection reset by peer] |
| 13:07:23 | | KRG` joins |
| 13:10:08 | <@arkiver> | OrIdow6: yeah i thought so |
| 13:10:43 | <@JAA> | I'm throwing a bunch of GREE stuff into AB right now. |
| 13:11:05 | <@JAA> | Just as a safety net. Chances are that'll miss a lot of it. |
| 13:12:41 | <@OrIdow6> | Thanks JAA |
| 13:13:13 | <@arkiver> | EggplantN: regarding a conversation we had some time ago, I want to move forward with a project to archive full websites through the warrior without very customized code for each |
| 13:13:30 | <@arkiver> | this could happen on #// or a different project |
| 13:13:58 | <@arkiver> | current plan is: |
| 13:14:16 | <@arkiver> | - for each website, set some config that the warrior all get |
| 13:14:40 | <@arkiver> | - this config has stuff like max depth, max time, etc. for a website |
| 13:14:54 | <@EggplantN> | So like the domains project we had for .eu |
| 13:14:56 | <@JAA> | So like #noteurdomain and #flashbang except better? |
| 13:15:00 | <@EggplantN> | Yeah |
| 13:15:07 | <@arkiver> | - a website is started by putting in an initial URL (for example the main page) |
| 13:15:22 | <@EggplantN> | Okay, I’d like to expand on that. |
| 13:15:37 | <@JAA> | Sounds nice. |
| 13:16:03 | <@arkiver> | - the warrior get a URL from the project, which includes current depth and time, depending on if time/depth are over the limit, the warrior will queue all new found URLs |
| 13:16:21 | <@arkiver> | here's the tricky part that'll need EggplantN |
| 13:17:09 | <@arkiver> | we'll need special queues for websites, so the warrior queues a URL back to the project, this URL goes into a specific queue, and URLs are only released to the tracker (or handed out from the tracker) at a certain maximum rate per second |
| 13:17:38 | <@arkiver> | i believe we previously had rabbitmq queues in mind, but not sure |
| 13:17:41 | <@EggplantN> | Okay. I see. |
| 13:18:15 | <@arkiver> | EggplantN: JAA: yes, sort of like those two |
| 13:18:48 | <@OrIdow6> | So basically a distributed AB with slightly less grab-time supervision? |
| 13:19:07 | <@JAA> | 'slightly' lol |
| 13:19:10 | <@arkiver> | yes, though the config will support excludes, etc. |
| 13:19:26 | <@arkiver> | will just take a minute or so for new configs to be distributed to all warriors |
| 13:20:11 | <@EggplantN> | Okay, yes I like this. So basically a xml/Json/YAML stored in BunnyCDN. Containing URL/URLs, ignore lists, site map links if needed. Then max time, size and amount of links. That can get send to a “master” of such. That will initiate the grab, get all the details and send the info to the tracker and then using some code in between it will |
| 13:20:11 | <@EggplantN> | validate URLs/limits etc and then feed to specific backfeed queues limited to x/minute releases |
| 13:20:35 | <@EggplantN> | in between of backfeed and the warriors |
| 13:21:41 | <@arkiver> | max size may not work, i was planning that the item a warrior receives is like 'DEPTH;TIME;URL' |
| 13:21:48 | <@EggplantN> | I’ve got a plan. Gimme a few hours I’ll draw it up for you. |
| 13:22:14 | <@JAA> | Time limit would be for that one URL download? |
| 13:22:37 | <@arkiver> | then the warrior adds 1 to the depth, records time (and adds to old time), and queues new items as 'DEPTH+1;TIME+DOWNLOAD_TIME;URL' |
| 13:22:44 | <@JAA> | Or additive over all depths so far? |
| 13:22:48 | <@arkiver> | additive JAA |
| 13:23:06 | <@JAA> | Right, so I guess that would be measured in seconds then. |
| 13:23:12 | <@arkiver> | for example yeah |
| 13:23:23 | <@JAA> | Unless a site is extremely slow, that is. |
| 13:23:48 | <@arkiver> | other method would be to record time somewhere, and add a flag to the config that the warrior should stop archiving the site if the time limit is broken |
| 13:25:02 | | sonick (sonick) joins |
| 13:25:06 | <@arkiver> | if we want to limit total URLs, we'll need some central command server, but for now i was not thinking of implementing that yet, just the queues from EggplantN and release from those queues at certain rate |
| 13:25:21 | <@arkiver> | (and config files, pulled every minute or so) |
| 13:25:29 | <@arkiver> | and then we can expand on that as we go along |
| 13:25:30 | | Doranwen quits [Ping timeout: 258 seconds] |
| 13:26:29 | <@EggplantN> | Yeah I know what you mean. Ive got some ideas |
| 13:26:43 | <@arkiver> | nice! |
| 13:30:11 | <@arkiver> | this project can be used for sites that we need to archive fast, or sites that are somewhat too large for AB, but too small (or too messy) for a custom warrior project |
| 13:31:15 | <@arkiver> | (appledaily is a pretty good example) |
| 13:33:58 | | Doranwen (Doranwen) joins |
| 13:39:27 | <Vista2003> | I've been running the Warrior program for a bit now and I've noticed that uploads keep failing as your servers have reached the maximum amount of connections |
| 13:41:29 | <@OrIdow6> | Vista2003: If this is just within the last few hours, it's known |
| 13:41:48 | <@OrIdow6> | (Though I don't know the details) |
| 13:43:24 | <@EggplantN> | Yep BunnyCDN was offline. I need to restart some things in 30 mins |
| 13:44:39 | <Vista2003> | OrIdow6 Yep, it's in the last few hours |
| 13:59:20 | <@EggplantN> | okay fixing now Vista2003 |
| 14:07:53 | | ragu__ quits [Client Quit] |
| 14:08:25 | | DogsRNice (Webuser299) joins |
| 14:23:38 | <@EggplantN> | So arkiver my plan was a bit more complex but nicer long term. |
| 14:23:38 | <@EggplantN> | With the effective "SpeedAB" I was planning to do a central server to manage it. effectively we would send it a domain/list of URLs similar to AB, parameters like sitemaps links, ignore sets, max size, max time, max URLs (i think you mean depth by this), offsite links y/n |
| 14:23:38 | <@EggplantN> | To do this, once the master gets the initial URL/s it sends them to a specific REDIS queue in either the urls/custom project. This queue has a limit of x/min (requires TP integration) they would likely be added as <ID>:url to the queue. |
| 14:23:38 | <@EggplantN> | Once the warrior has grabbed the item/items it then grabs a "master config" for that <ID> from either wasabi/BCDN S3 this contains ignore sets, and any other configs the warriors may require. |
| 14:23:38 | <@EggplantN> | As it grabs the page and discovers extra links it sends them to a different backfeed, this checks that we have not exceeded our time, size or url limit before sending them to the main backfeed. |
| 14:23:39 | <@EggplantN> | As the items are returned to the tracker they update the main redis size/item count column for the whole project, THEN stores a second one for the <ID>. (requires TP integration) |
| 14:23:39 | <@EggplantN> | Every 15-60 seconds the master checks the size/item key, if exceeded sets the x/min to 0 for that ID. Admins can override/increase the size/item count to resume grabbing if they wish. |
| 14:23:40 | <@EggplantN> | Once project is finished either no more items or limits are exceeded and admins MARK the job as done (decide not to up them/continue) the remaining items in redis are destroyed and any specific keys also. |
| 14:24:06 | <@EggplantN> | Complicated I know, but effectively mimics AB in a speedier, more efficient and usable way |
| 14:24:43 | | atari800 joins |
| 14:28:19 | <atari800> | Question? I'm trying to run archivebot under docker for the first time. Getting "Error loading pipeline"..."Project (whatever) did not install correctly" for everything except urlteam. Can I fix that? Googling didn't provide much I found useful |
| 14:28:43 | <@EggplantN> | Just wondering why you're trying to run ArchiveBot first atari800 |
| 14:29:15 | <atari800> | I'm sorry- I meant Warrior. |
| 14:29:32 | <@EggplantN> | Which project isn't working and which Docker Image are you using |
| 14:31:12 | <atari800> | reddit, for instance, isn't working, but also github, google sites, and so on. Using archiveteam/warrior-dockerfile:latest |
| 14:33:36 | <@arkiver> | EggplantN: |
| 14:33:37 | <atari800> | Let's see: it came from https://hub.docker.com/r/archiveteam/warrior-dockerfile/ |
| 14:33:42 | <@EggplantN> | Use atdr.meo.ws/archiveteam/warrior-dockerfile:latest |
| 14:33:48 | <@arkiver> | what is "stores a second one for the <ID>"? |
| 14:33:48 | <@EggplantN> | Kaz needs to update the DH one manually |
| 14:34:03 | <@EggplantN> | a second size/item count arkiver |
| 14:34:11 | <@EggplantN> | so we can see that jobs total |
| 14:34:38 | <@arkiver> | i think it sounds good |
| 14:34:43 | <atari800> | EggplantN2 thanks, will try that now. |
| 14:34:54 | <@arkiver> | but is this doable for now, or is this going to be too complicated to set up in a reasonable amount of time? |
| 14:35:06 | <@EggplantN> | define reasonable amount of time? |
| 14:35:06 | <@arkiver> | else we can always still go with the initial simpler idea and build upon that |
| 14:35:23 | <@arkiver> | though this is indeed a nice idea |
| 14:35:35 | <@arkiver> | reasonable amount is perhaps few weeks or a month or so? |
| 14:35:52 | <@EggplantN> | oh sure I think i just need to chat to rewby about his feelings on the design. |
| 14:35:57 | <@EggplantN> | and then have the trackerproxy mods |
| 14:36:08 | <@arkiver> | i'm not a fan of "SpeedAB" btw |
| 14:36:10 | <@EggplantN> | They aren't overly complicated TP mods I dont think |
| 14:36:14 | <@EggplantN> | well whatever we call it lol |
| 14:36:16 | <@arkiver> | but the name is a detail |
| 14:36:23 | <atari800> | "Invalid Docker Repository Url: http://atdr.meo.ws/archiveteam/warrior-dockerfile:latest" |
| 14:36:37 | <@JAA> | atari800: No http:// |
| 14:36:53 | <@EggplantN> | https://irc.fu.is/uploads/b51af014949272e6/image.png |
| 14:36:57 | <@EggplantN> | ah yes that |
| 14:37:00 | <@EggplantN> | no http:// |
| 14:37:17 | <atari800> | without the http: "Registry returned bad result." |
| 14:37:28 | <@EggplantN> | try once more? |
| 14:37:31 | <@EggplantN> | what command? |
| 14:38:27 | <@arkiver> | and on warrior getting the config, that will happen periodically, not every every item (or multi-item) it receives |
| 14:38:39 | <@arkiver> | JAA: do you have opinions on what EggplantN wrote? |
| 14:38:44 | <@EggplantN> | yeah, thats up to you, but you get the general idea |
| 14:38:50 | <@arkiver> | yep |
| 14:39:01 | <@EggplantN> | that config can also have UA's in it even etc |
| 14:39:05 | <@EggplantN> | like AB |
| 14:39:31 | <atari800> | "Invalid Docker Repository Url: atdr.meo.ws/archiveteam/warrior-dockerfile:latest" |
| 14:39:35 | <@arkiver> | sure, can have pretty much anything EggplantN |
| 14:39:44 | <@EggplantN> | thats what I was thinking |
| 14:39:56 | <@EggplantN> | what command are you running atari800 |
| 14:40:03 | <@JAA> | EggplantN: Depth = recursion level. Root = level 0, links on it = level 1, links on those = level 2, etc. AB goes to infinite depth, but wpull has an option to limit it. That's what arkiver meant by depth. |
| 14:40:15 | <@JAA> | But an overall URL count limit is also a good idea. |
| 14:40:35 | <@arkiver> | both max URL count and max depth would be good to have |
| 14:40:39 | <@JAA> | Yep |
| 14:40:39 | <@EggplantN> | yeah it would. |
| 14:41:00 | <@arkiver> | i doubt we can create a channel name for this as awesome as #// |
| 14:41:07 | <@EggplantN> | oh we can |
| 14:41:09 | <atari800> | EggplantN I'm using the docker interface on synology. Just choosing "add from URL" and pasting it in. |
| 14:41:10 | <@JAA> | But it requires extra data for each item. |
| 14:41:18 | <@EggplantN> | #// |
| 14:41:20 | <@EggplantN> | wait fuck sec |
| 14:41:24 | <@JAA> | Because a URL can be found in multiple places, and we'd only want to keep it at the lowest depth. |
| 14:41:29 | <@EggplantN> | #🍆 |
| 14:41:35 | <@JAA> | -1 |
| 14:41:38 | <@EggplantN> | EMOJI NAMES |
| 14:41:42 | <@arkiver> | yeah no |
| 14:41:44 | <@EggplantN> | lol |
| 14:41:50 | <@JAA> | #\\ ? |
| 14:41:54 | <@JAA> | :-P |
| 14:42:02 | <@JAA> | Guaranteed to not confuse anyone. |
| 14:42:02 | <@EggplantN> | #:// |
| 14:43:30 | <rewby> | #\o/ |
| 14:43:49 | <rewby> | Or #/o\ since that's how people feel when things go down |
| 14:44:09 | <rewby> | Or maybe #/!\ |
| 14:44:31 | <@JAA> | atari800: I know someone had issues before with Synology's poor integration of Docker. Don't remember the details though. |
| 14:44:36 | <atari800> | If I go to atdr.meo.ws/archiveteam/warrior-dockerfile or even atdr.meo.ws in my browser manuall I just get "nope". Is that expected behavior? |
| 14:44:45 | <@JAA> | Yes |
| 14:44:49 | <atari800> | ok |
| 14:45:01 | <atari800> | JAA Hrm. :( |
| 14:45:22 | <@JAA> | Grepped my logs, don't see a solution there, just 'doesn't work'. |
| 14:46:02 | <@JAA> | Or well, no solution with the UI. They got it working through the CLI (surprise!). |
| 14:47:41 | <atari800> | Hm. thanks. I'm new to both synology and docker so this may be beyond me rn. Just thought it would be fun to run the warrior on another machine besides my Mac |
| 14:49:11 | <@JAA> | #recursionisrecursion is a bit long. :-/ |
| 14:50:03 | <@JAA> | Oh also, separate WARCs by target site please. |
| 14:50:42 | <@arkiver> | yeah that can be done with some prefix to the WARC returned by the warrior |
| 14:51:00 | <@JAA> | Yep, with changes in the factory I guess. |
| 14:51:03 | <@EggplantN> | separate warcs by target site is the hardest thing |
| 14:51:05 | <@EggplantN> | but we can do |
| 14:51:18 | <@arkiver> | hardest thing? |
| 14:51:25 | <@JAA> | Nah |
| 14:51:27 | <@arkiver> | sounds like one of the easiest parts |
| 14:51:31 | <@EggplantN> | probably, i can think of a way to do everything else but that lol |
| 14:51:33 | <@JAA> | Coming up with a good channel name is the hardest part. |
| 14:51:38 | <@arkiver> | ^ |
| 14:51:41 | <@EggplantN> | true |
| 14:52:08 | <@EggplantN> | issue with factory is we then need a way to tell it that project is done |
| 14:52:24 | <@EggplantN> | i mean, we can do that but it means more effort lol |
| 14:52:27 | <@arkiver> | well |
| 14:52:43 | <@JAA> | Unless the factory just groups together files by prefix without caring about the actual prefix value. |
| 14:52:45 | <@arkiver> | JAA: would you be fine with no separate items per project? |
| 14:53:12 | <@arkiver> | so each item on IA can contain megaWARCs from different sites |
| 14:53:46 | <@arkiver> | like how in the case of zstd, each item on IA can contain multiple megaWARCs (one for each dictionary ID) |
| 14:53:48 | <@JAA> | As long as each megaWARC only contains records from one target site, yeah, I'm fine with that. |
| 14:53:52 | <@arkiver> | yeah |
| 14:53:57 | <@arkiver> | so that is fixed EggplantN |
| 14:54:09 | <rewby> | Terrible project idea name: GottaGrabEmAll |
| 14:54:16 | <@arkiver> | we'll do it like how we have multiple megaWARCs/item for each dictionary |
| 14:54:27 | <@arkiver> | rewby: it is indeed terrible :P |
| 14:54:49 | <rewby> | If there's one thing I'm good at it's bad names and groan-inducing puns |
| 14:54:52 | <@JAA> | ARCS, short for ARCS Recursively Crawls Sites |
| 14:54:59 | <@EggplantN> | go old school |
| 14:55:00 | <@EggplantN> | DPOS |
| 14:55:07 | <rewby> | Thats a name conflict with the ARC format, JAA |
| 14:55:24 | <rewby> | I like DPOS |
| 14:55:25 | <@JAA> | Yeah, but ARC sucks. :-P |
| 14:55:35 | <@JAA> | Nah, DPoS is already the general term for our distributed projects. |
| 14:56:07 | <@EggplantN> | hrm, we could send them as the <ID> in the warc prefix. the chunker then could split them via ID, rename them based on the config from the ID (same place warrior gets it). Once its done that, the only thing we gotta do is make a way of telling the chunker to move the file size (if smaller than chunker size) to packing-queue once project is done |
| 14:56:24 | <rewby> | DGrab - Distributed Grab (Site) |
| 14:56:33 | <@arkiver> | EggplantN: i'm fine with not renaming, just using the prefix on the WARCs |
| 14:56:42 | <@EggplantN> | well that works. |
| 14:56:51 | <@arkiver> | EggplantN: we dont have to check if a project is done |
| 14:57:01 | <@arkiver> | we simply move x amount of WARCs like we do now |
| 14:57:07 | <Jake> | atari800: I think you'll need to add it to the repository list in the UI I believe? https://kb.synology.com/en-us/DSM/help/Docker/Docker?version=6 |
| 14:57:12 | <@arkiver> | then pack them into multiple megaWARCs according to their prefix |
| 14:57:20 | <@arkiver> | then upload all resulting megaWARCs to same item |
| 14:57:29 | <@arkiver> | like what we do with multiple dictionaries for zstd |
| 14:57:30 | <@JAA> | WARCS: Where ArchiveTeam Recursively Crawls Sites |
| 14:57:42 | <rewby> | JAA pls |
| 14:57:46 | <rewby> | And I thought I was bad |
| 14:57:48 | <@EggplantN> | so. we will have multiple sites in the same MegaWARC? |
| 14:57:49 | <@JAA> | lol |
| 14:58:09 | <@JAA> | You challenged me. :-P |
| 14:58:13 | <@arkiver> | EggplantN: no |
| 14:58:39 | <rewby> | GrabbyMcGrabFace |
| 14:59:10 | <@arkiver> | multiple megaWARCs (one for each 'site'/prefix), all into a single item |
| 14:59:13 | <@OrIdow6> | #:(){:|:&};: |
| 14:59:18 | <@EggplantN> | so. thats the issue. so lets say, chunker size is 5GB. project is 83GB, theres 16 megawarcs at 5GB, and then 3GB sat in the chunker. waiting to become 5GB, we need a way to tell the chunker to stop waiting for it to be 5GB and move to the packing queue. |
| 14:59:21 | <@arkiver> | total being the max size we now have per megaWARC |
| 14:59:46 | <@JAA> | OrIdow6: Heh, I was actually thinking about something like that before. #f(f()) |
| 14:59:47 | <rewby> | EggplantN: Maybe just say like "if no new data for 24 hours, consider the warc done" |
| 14:59:52 | <@arkiver> | EggplantN: the other 3 GB will sit waiting, and be processed and uploaded when another site fills it to 5 GB |
| 15:00:03 | <@EggplantN> | right, but that means that 2 sites are in the same megawarc |
| 15:00:08 | <@arkiver> | no |
| 15:00:12 | <rewby> | Yes it does? |
| 15:00:13 | <@arkiver> | multiple megaWARCs per item |
| 15:00:25 | <@EggplantN> | right now how the factory is |
| 15:00:30 | <rewby> | Egg is talking about the chunker, not the uploader |
| 15:00:32 | <@EggplantN> | you would have multiple sites in 1 megawarc |
| 15:00:33 | <@arkiver> | remember how we can have multiple megaWARCs if there's multiple dict IDs at the same time? |
| 15:00:39 | <@arkiver> | no |
| 15:00:46 | <@arkiver> | JAA: do you get what i mean? |
| 15:00:47 | <@EggplantN> | im confused. |
| 15:01:03 | <@EggplantN> | yes we can have different dict IDs in 1 megawarc, I get that |
| 15:01:11 | <@JAA> | Yeah, separate dicts for each target site, so the existing factory already does the splitting automatically. |
| 15:01:11 | <@arkiver> | no |
| 15:01:13 | <@arkiver> | we cant |
| 15:01:18 | <rewby> | Here's a project name idea: #Y - for the lambda Y combinator that does recursion in lambda calculus |
| 15:01:21 | <@arkiver> | yes like JAA says |
| 15:01:24 | <@EggplantN> | it does? |
| 15:01:26 | <@EggplantN> | uh |
| 15:01:29 | <@arkiver> | yeah |
| 15:01:37 | <@JAA> | It must, because a .warc.zst can only have one dict. |
| 15:01:44 | <@EggplantN> | *checks factory as I was not aware it did that* |
| 15:01:46 | <@arkiver> | will try to find an example hold on |
| 15:02:07 | <@OrIdow6> | JAA: Unfortunately, the # makes it a comment |
| 15:02:22 | <rewby> | #Y is free so we could use that |
| 15:02:30 | <@JAA> | OrIdow6: Yeah. I also wonder how many people would have trouble joining that. It would be glorious. |
| 15:02:37 | <@OrIdow6> | If anyone knows of a language where the channel name is an executable fork bomb... I'd like to hear it |
| 15:02:40 | <@OrIdow6> | Yeah, good point |
| 15:03:11 | <rewby> | OrIdow6: I like #Y because it's the lambda calculus Y combinator (I.e. how you do recursion in lambda calculus) |
| 15:03:23 | <jodizzle> | JAA, arkiver, Jake: FYI, the video list was incomplete (it's why I put '.1.txt' in the filename). Sorry if that was unclear. Essentially I took rewby's article list and started iterating through them, soaking up the .m3u8s. I'll have more lists to run assuming the site holds up, though I guess I should share the lists in #//. |
| 15:03:24 | <@arkiver> | EggplantN: https://archive.org/download/archiveteam_reddit_20200727111749_177315d9 |
| 15:03:45 | <@arkiver> | one megaWARC for each dict ID |
| 15:04:00 | <@arkiver> | we'll do the same for the domains project, one megaWARC for each 'site'/ID prefix |
| 15:04:04 | <@EggplantN> | I was completely unaware it did that |
| 15:04:04 | <@arkiver> | but together in the same item |
| 15:04:06 | <@JAA> | jodizzle: Ah, right, I did notice that '.1' actually. Any idea how many more? AB handled the first one just fine. |
| 15:04:17 | <@EggplantN> | me and rewby didnt even know that >_> |
| 15:04:30 | <@arkiver> | so the remaining 3 GB will be handled when another project fills it up to 5 GB (and then there'll be two megaWARCs) |
| 15:04:36 | <@arkiver> | EggplantN: well now you know :) |
| 15:04:46 | <rewby> | Yeah, that's quite interesting |
| 15:04:47 | <@arkiver> | i'll update the megaWARC factory later to handle that |
| 15:04:50 | <@EggplantN> | yeah I get you now, okay so we need only to change megawarc then. |
| 15:04:53 | <atari800> | Jake So would I add "https://atdr.meo.ws" to the registry? (Trying that I get "Failed to query registry.") |
| 15:05:27 | <Jake> | atari800: try just "atdr.meo.ws"? (I've honestly never used a Synology before) |
| 15:06:26 | <@JAA> | rewby: Why not #λ then? :-) |
| 15:06:29 | <atari800> | It wants http:// or https:// |
| 15:06:42 | <@arkiver> | i'll make sure we write to the metadata of the item which sites the item contains when we move it out of #// |
| 15:07:06 | <@JAA> | Oh wait, nevermind. |
| 15:07:32 | <jodizzle> | JAA: No, I'm not sure how many more. The first list was covering only the first ~300k articles I think. |
| 15:08:02 | <rewby> | JAA: Because I can't type that with my keyboard |
| 15:08:55 | <@JAA> | Oh, multi-item requests would also have to only return URLs for one target site. |
| 15:09:37 | <@JAA> | rewby: Weak. Ctrl+Shift+U 3bb |
| 15:10:00 | <rewby> | JAA: Doesn't work on my system. |
| 15:10:17 | <rewby> | This terminal has no unicode input support |
| 15:10:25 | <rewby> | Oh wait |
| 15:10:26 | <@JAA> | lol |
| 15:10:27 | <rewby> | Yes it does |
| 15:10:43 | <rewby> | λ It just doesn't show me it's actually doing unicode input |
| 15:10:45 | <rewby> | gg terminal |
| 15:11:05 | <rewby> | I mean, I like #Y because it's specifically the fixpoint combinator |
| 15:11:44 | <@JAA> | Yeah, that's what I realised and why I wrote 'Oh wait, nevermind.'. :-) |
| 15:11:53 | <rewby> | Ah I see |
| 15:11:59 | <rewby> | Yeah, that's the joke I was going for |
| 15:12:15 | <rewby> | It's short like #// |
| 15:12:27 | <rewby> | And the channel is free |
| 15:13:02 | <rewby> | (well, not anymore since Im' in there now. But registering it to AT is easy enough) |
| 15:13:53 | <atari800> | Is there a chance that archiveteam/warrior-dockerfile will get updated at some point so I don't need to fight with meo.ws? |
| 15:14:43 | | KRG joins |
| 15:14:43 | | KRG is now authenticated as KRG |
| 15:14:52 | <@EggplantN> | yes it will |
| 15:14:54 | | KRG` quits [Ping timeout: 250 seconds] |
| 15:14:55 | <@EggplantN> | it needs Kaz to do it |
| 15:15:00 | <@EggplantN> | or i think arkiver might be able too |
| 15:15:31 | <@JAA> | OrIdow6: You could probably build a recursive thing in C with a valid channel name as lines with # are preprocessor directives, not comments. Not sure how to actually do it though. |
| 15:16:34 | <@JAA> | Actually, maybe not because all directives require arguments with a space. :-/ |
| 15:16:43 | <@OrIdow6> | Yeah |
| 15:16:45 | <@JAA> | EggplantN: What does it need? |
| 15:16:56 | <@EggplantN> | building again |
| 15:17:03 | <@JAA> | On DH? |
| 15:17:08 | <@EggplantN> | ya |
| 15:17:13 | <rewby> | Can't you just pull the image from meo.ws , retag and push |
| 15:17:45 | | KRG quits [Remote host closed the connection] |
| 15:18:05 | | KRG joins |
| 15:18:05 | | KRG is now authenticated as KRG |
| 15:18:56 | <@JAA> | Nope, don't see how I can do that on DH's side. |
| 15:19:23 | <@JAA> | Since it's not connected to a repo, needs manual pushing it seems. |
| 15:19:23 | <rewby> | You can do it from your local docker client |
| 15:19:23 | | KRG` joins |
| 15:19:35 | <@Kaz> | hello |
| 15:19:36 | <@Kaz> | what do you need |
| 15:19:51 | <@JAA> | Update of the Docker Hub warrior image |
| 15:22:25 | | KRG quits [Ping timeout: 258 seconds] |
| 15:22:50 | <@Kaz> | the answer may be no |
| 15:22:59 | <@Kaz> | did they finally turn off builds for free accounts? |
| 15:24:03 | <@Kaz> | yeah, someone's actually gonna have to build it manually, then push it back up |
| 15:24:04 | <@Kaz> | wack |
| 15:24:08 | <@OrIdow6> | EggplantN: How does the process of ending a grab once size, time, etc. limit is exceeded? They stop going into the queue ("this checks that we have not exceeded our time"), but does it immediately stop it or let the queue empty? |
| 15:24:24 | <@OrIdow6> | Dictionary IDs might concievably collide if you kept them around forever |
| 15:24:33 | <@JAA> | Kaz: Yeah, looks like it. But we have the image on atdr. Can just pull/push that, no? |
| 15:24:48 | <@Kaz> | in theory yes |
| 15:24:58 | <@Kaz> | I mean, the 'building it' bit isn't a problem anyway |
| 15:25:07 | <@Kaz> | i just don't have the right things in front of me to do it atm |
| 15:25:53 | <@JAA> | Right |
| 15:26:10 | | Arcorann quits [Ping timeout: 250 seconds] |
| 15:26:12 | | katocala quits [Remote host closed the connection] |
| 15:26:50 | <@JAA> | Hold my beer, I'm going to fuck this up. :-) |
| 15:27:13 | | rewby holds the beer |
| 15:33:33 | | katocala joins |
| 15:33:49 | | katocala is now authenticated as katocala |
| 15:34:06 | <@JAA> | Pushing is horribly slow for some reason, but it's going. |
| 15:35:49 | <@JAA> | Like 100 kB/s slow. |
| 15:40:38 | <@JAA> | atari800: I need to leave for a bit, but try archiveteam/warrior-dockerfile again in 10-15 minutes or so. Should be pushed by then. |
| 15:41:04 | <@JAA> | (Also, fuck Synology for making it so hard to use a custom registry.) |
| 15:41:12 | <atari800> | JAA thank you! |
| 15:41:14 | <rewby> | Maybe we should setup the CI to automatically push to DH as well? |
| 15:41:44 | <@JAA> | No, we should stop using DH altogether. But crap like this breaks that plan. |
| 15:58:36 | <h3ndr1k> | atari800: There are instructions on how to add custom registries. https://kb.synology.com/en-global/DSM/help/Docker/Docker?version=6#b_19 |
| 15:58:36 | <h3ndr1k> | But I don't have any synology devices either, so cannot help otherwise. |
| 15:59:33 | <atari800> | @JAA I re-loaded it, and it works now. Thank you! |
| 16:00:59 | <atari800> | h3ndr1k thanks. I did see that earlier - It didn't work for me as easy as as all that. But I'll keep that URL for next time. |
| 16:07:10 | <@EggplantN> | so OrIdow6 the "ending a grab" if limits are exceeded actually just pauses it. It sets the items/min counter to 0 for that site. any "out" jobs can continue to return URLs to that queue. Admins will end the job (via a button, irc bot, command line) and this will empty the queues OR admins can up the limit if the feel |
| 16:08:49 | <@OrIdow6> | EggplantN: Oh |
| 16:09:00 | <@EggplantN> | That way we have more control. |
| 16:09:02 | <@EggplantN> | Was my idea |
| 16:09:10 | <marked> | maybe the recursive redesign discussion fits better in -dev ? |
| 16:09:16 | <@OrIdow6> | Yeah, sounds good |
| 16:09:27 | <@OrIdow6> | The queue thing, not channel thing |
| 16:09:41 | <@OrIdow6> | marked: We were discussing setting up a specific channel for it |
| 16:09:55 | <@OrIdow6> | It is better to spam up bs a bit than to split it b/t 3 channels |
| 16:09:55 | <atari800> | Woo, got two warriors running on the Synology! Thanks team. |
| 16:10:35 | <atari800> | (...running Reddit this time, not just urlteam) |
| 16:12:13 | <Jake> | I assume just with the docker hub one? |
| 16:13:11 | <atari800> | @jake yes. |
| 16:13:24 | <Jake> | ah, well glad we could get it working. |
| 16:14:29 | <@EggplantN> | My idea is to have as many features as AB has, but a ton faster/more flexible. Oridow6 if you have any design ideas do let me know |
| 16:15:27 | | Wingy0 (Wingy) joins |
| 16:15:42 | | Wingy quits [Ping timeout: 258 seconds] |
| 16:15:42 | | Wingy0 is now known as Wingy |
| 16:16:07 | <rewby> | Still vote for #Y for the channel |
| 16:17:00 | <@arkiver> | EggplantN: i'll leave that to kaz, never did it before. might have credentials somewhere |
| 16:17:32 | <@arkiver> | rewby: why? |
| 16:18:33 | <rewby> | arkiver: We were discussing the channel name for recursive grabs. And Y is the fixpoint combinator in lambda calculus. That is, it's how you do recursion in lambda calculus. On top of that it's nice and short. |
| 16:18:40 | <rewby> | And the channel is free |
| 16:18:59 | <@OrIdow6> | EggplantN: I've had thoughts on this before but haven't listed them out, I'll try to think of them |
| 16:19:04 | <@arkiver> | right |
| 16:19:24 | <@arkiver> | EggplantN: any idea when the queuing may be done? |
| 16:19:54 | | Wingy quits [Ping timeout: 250 seconds] |
| 16:20:41 | <@arkiver> | well the implementation of that |
| 16:20:58 | <@arkiver> | might want to ask fus l for input as well |
| 16:21:27 | <@OrIdow6> | rewby: Do you know if there are any contexts where it is represented w/ punctuation? Would be nice to continue the #// theme even further |
| 16:21:40 | <rewby> | Not that I know of |
| 16:21:48 | <@arkiver> | right, though lets not make thing too complicated |
| 16:22:00 | <@arkiver> | #// may have been a nice very nice one-off |
| 16:22:25 | <@arkiver> | things* |
| 16:22:27 | <@OrIdow6> | Well I'm in #Y right now anyhow |
| 16:22:40 | | atari800 quits [Ping timeout: 244 seconds] |
| 16:23:56 | <@arkiver> | OrIdow6: can be, still not a big fan |
| 16:26:57 | <@OrIdow6> | arkiver: Besides #recursionisrecursion ("too long" according to its suggester) and some variation on "SpeedAB", I think it's the only non-punctuation one suggested |
| 16:27:13 | <@EggplantN> | "EggplantN: any idea when the queuing may be done?" |
| 16:27:16 | <@EggplantN> | what do you mean by this |
| 16:27:49 | <@arkiver> | EggplantN: when we will be able to get submit some item from the warrior through backfeed, it ending up in your queues, the queues feeding back at a certain rate to the tracker |
| 16:28:18 | <@arkiver> | (feeding back to tracker depending on how many todo items there are in the tracker, else we get a build up of URLs) |
| 16:28:23 | <@EggplantN> | oh I'd like something within a couple weeks |
| 16:28:54 | <@EggplantN> | it does need those 2 TP mods. One is already kinda used before |
| 16:29:18 | <@arkiver> | we can discuss that with Fusl |
| 16:30:13 | <@arkiver> | (unless you feel like like figuring out the changed to trackerproxy) |
| 16:30:27 | <@EggplantN> | probably safer for Fus l |
| 16:31:12 | | Wingy (Wingy) joins |
| 16:53:16 | | save_fn quits [Ping timeout: 250 seconds] |
| 16:56:01 | | marked quits [Quit: The Lounge - https://thelounge.chat] |
| 16:56:45 | | marked joins |
| 17:28:10 | | Daloader joins |
| 17:33:08 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 17:33:35 | | Mateon1 joins |
| 17:41:06 | <@JAA> | I'm looking into the English Apple Daily now. I can confirm that the AB jobs are incomplete due to the infinite scrolling crap. |
| 17:43:40 | <@arkiver> | can we do the same as with the hk version> |
| 17:43:41 | <@arkiver> | ? |
| 17:43:53 | <@JAA> | No, the sitemaps on en are rubbish. |
| 17:44:04 | <@JAA> | They only go back one month. |
| 17:46:07 | <@JAA> | Or you mean regarding videos etc.? |
| 17:46:54 | <@JAA> | It looks like the pagination is actually surprisingly sane and might even work in the WBM. |
| 17:46:59 | <@JAA> | Just scripty. |
| 17:47:30 | <@JAA> | I'm collecting the pagination URLs now, will run those through AB and then get the article URLs from that. |
| 17:48:12 | | Vista2003 quits [Remote host closed the connection] |
| 17:49:27 | <@JAA> | And then we can feed those into AB, #//, and whatever else. |
| 17:49:32 | <@JAA> | The more, the merrier. :-) |
| 17:52:08 | <@arkiver> | yeah duplication on this is not a problem |
| 18:08:16 | | KRG` quits [Remote host closed the connection] |
| 18:19:28 | | guest joins |
| 18:42:42 | | balrog quits [Quit: Bye] |
| 18:52:46 | | atari800 joins |
| 18:56:08 | | balrog joins |
| 18:56:08 | | balrog is now authenticated as balrog |
| 19:01:30 | <@JAA> | EggplantN: For #// because can't hurt: https://transfer.archivete.am/12W4cM/en.appledaily.com-articles.zst |
| 19:03:50 | <@EggplantN> | Do you have access to the BB host? |
| 19:03:54 | <@EggplantN> | If not gimme 4 hrs |
| 19:03:59 | <@EggplantN> | I’m getting arseholed |
| 19:04:23 | | atari800 quits [Ping timeout: 244 seconds] |
| 19:05:17 | <@JAA> | I do but have no idea how to queue things. |
| 19:05:53 | <@JAA> | It's running through AB anyway, so should be fine, but also can't hurt to grab a second copy. |
| 19:08:07 | <@JAA> | I.e. if you can't tell me how to do it because you're busy, no worries. :-) |
| 19:13:06 | <Jake> | JAA: [2021-01-18T18:17:30.181Z] <@Fus l> blackbird under /root/scripts/ [2021-01-18T18:17:35.769Z] <@Fus l> blackbird under /root/scripts/urls/ [2021-01-18T18:17:59.194Z] <@Fus l> `cat /data/urls/LIST.txt | node urls.js > /data/urls/LIST.txt.failed` |
| 19:14:27 | <Jake> | (no idea if these are the right instructions, but I remember it being talked about in my logs) |
| 19:15:06 | | KRG joins |
| 19:15:06 | | KRG is now authenticated as KRG |
| 19:15:19 | <@EggplantN> | That |
| 19:15:22 | <@JAA> | Ack |
| 19:15:24 | <@JAA> | Thanks Jake! |
| 19:15:26 | <@EggplantN> | but store big data in /data |
| 19:15:33 | <@JAA> | This is tiny, just 9k URLs. |
| 19:15:34 | <@EggplantN> | or whatever the Zfs store is |
| 19:15:38 | <@EggplantN> | oh that’s fine |
| 19:15:46 | <Jake> | np. ;) |
| 19:15:48 | <@EggplantN> | just yeet it through URLs.js |
| 19:16:02 | <@EggplantN> | it does URL validation if yours was shit :P |
| 19:16:27 | <@JAA> | grep from HTML? :-P |
| 19:17:10 | <@JAA> | jodizzle: I found a few videos on en.appledaily.com that have MP4 videoUrls rather than M3U8. Did you take that into account for hk? |
| 19:18:45 | <jodizzle> | No, I didn't. Though I also haven't found any. |
| 19:18:54 | <jodizzle> | I should probably expand my regex. |
| 19:19:50 | <@JAA> | For en, I was able to get a list of all articles with videos from the pagination because they have a triangle play icon thingy. Then my regex from the article pages was just 'videoUrl'. :-P |
| 19:21:01 | <@JAA> | en.appledaily.com-articles.zst is queued to #//. |
| 19:22:49 | | Lord_Nightmare quits [Client Quit] |
| 19:24:00 | <@JAA> | Welp, the site is dead. |
| 19:25:45 | <jodizzle> | Interesting. hk seems to have a similar triangle icon in the article listings, but there are definitely articles that have m3u8s without having that icon. |
| 19:25:50 | <jodizzle> | en is dead? |
| 19:26:01 | <@JAA> | Oh |
| 19:26:14 | <@JAA> | Ok, I'll check for that when the AB job finishes. |
| 19:26:25 | <@JAA> | Yeah, en's timing out. Guess #// was too much for it. |
| 19:27:51 | <@JAA> | There we go, it's back. |
| 19:28:40 | <jodizzle> | It's possible that the videos on the articles without the icon are generic videos as opposed to being "real" content, though. I haven't checked extensively. |
| 19:29:01 | <@JAA> | ¯\_(ツ)_/¯ Archive *ALL* the videos! |
| 19:29:04 | <jodizzle> | I haven't gotten any URL dupes in my m3u8 collecting, at least. |
| 19:29:07 | <jodizzle> | :) |
| 19:30:45 | | Lord_Nightmare (Lord_Nightmare) joins |
| 19:31:25 | <@JAA> | Oh yeah, the on-site video player (at least on en) also does some signature URL parameter stuff. Doesn't actually matter as you can access the content without the signature, but that may well break WBM playback. :-/ |
| 19:32:20 | <@JAA> | The signature is calculated in the browser though with the secret in a JS string. lol |
| 19:36:34 | <jodizzle> | That's awful. |
| 19:38:22 | | Daloader quits [Ping timeout: 250 seconds] |
| 19:43:53 | <@EggplantN> | JAA is the site still dead? |
| 19:43:55 | <@EggplantN> | if so |
| 19:44:01 | <@EggplantN> | login to tracker.at |
| 19:44:04 | <@JAA> | No, it recovered a couple minutes later. |
| 19:44:09 | <@EggplantN> | Ah okie |
| 19:44:34 | <@JAA> | Might need to check later whether they went through or it just got the #// workers banned. |
| 19:44:51 | <@EggplantN> | Well |
| 19:45:02 | <@EggplantN> | Do you have tracker ssh? |
| 19:45:27 | <@EggplantN> | if so. Please check .bash_history for a command called projectcli |
| 19:45:34 | <@EggplantN> | run the function command |
| 19:45:38 | <@EggplantN> | then run |
| 19:45:45 | <@EggplantN> | projectcli urls |
| 19:45:51 | <@EggplantN> | del urls:maxtries |
| 19:45:59 | <@EggplantN> | and I’ll figure that later |
| 19:46:06 | <@EggplantN> | it’ll stop urls getting removed from the queue |
| 19:46:14 | <@JAA> | Ah |
| 19:46:31 | <@EggplantN> | *claims |
| 19:46:34 | <@JAA> | Done |
| 19:47:15 | <@EggplantN> | Cheers. I’ll get you data later |
| 19:47:25 | <@JAA> | Thanks :-) |
| 19:49:51 | | guest quits [Ping timeout: 244 seconds] |
| 19:52:25 | <@JAA> | /*stdin*\ : 1.29% (1099021 => 14154 bytes |
| 19:52:34 | <@JAA> | How, zstd? How the fuck are you doing this? lol |
| 19:57:25 | <rewby> | EggplantN: Have you considered adding projectcli to the bashrc |
| 20:02:56 | <@EggplantN> | I need to |
| 20:02:57 | <@EggplantN> | yes |
| 20:03:11 | <@EggplantN> | I just haven’t yet |
| 20:03:29 | <@JAA> | This sounds awfully familiar... |
| 20:19:28 | <tech234a> | Need a channel name? How about #! |
| 20:20:47 | <tech234a> | Looks like everyone is in #Y already though |
| 20:26:10 | | nerdguy1138 (nerdguy1138) joins |
| 20:26:12 | | nerdguy1138 quits [Client Quit] |
| 20:28:46 | | Webuser431 is now known as SomeRando |
| 20:34:54 | | mary quits [Quit: Exiting] |
| 20:54:26 | | thuban joins |
| 21:46:04 | | Larsenv quits [Quit: ZNC 1.8.2+deb1+focal2 - https://znc.in] |
| 21:52:47 | | lorwp quits [Client Quit] |
| 21:53:27 | | lorwp (lorwp) joins |
| 21:58:47 | | lorwp quits [Client Quit] |
| 21:59:25 | | lorwp (lorwp) joins |
| 22:11:34 | <@JAA> | I can confirm that there are no other videos on en.appledaily.com than the one I found from the pagination according to the WARC from 18htae9bm7std751bvsz9qcuo. |
| 22:11:43 | <@JAA> | the ones* |
| 22:15:37 | | Ruthalas quits [Client Quit] |
| 22:22:51 | <kpcyrd> | did somebody look into sks? |
| 22:36:02 | | Ruthalas (Ruthalas) joins |
| 22:53:30 | <Jake> | what is sks? |
| 22:53:57 | <@EggplantN> | no clue |
| 23:26:51 | <@JAA> | I declare en.appledaily.com covered. Pagination, article pages, images (main images listed in the pagination, but I haven't seen any articles with more than one image), and videos have all been archived. |
| 23:32:10 | <marked> | there's some sister publications here but hard to say what's most at risk https://en.wikipedia.org/wiki/Next_Digital |
| 23:33:30 | <@JAA> | Yeah, Next Digital is the parent company of Apple Daily. |
| 23:45:09 | <@arkiver> | tech234a: i'm open to changing |
| 23:45:11 | <@arkiver> | why #! ? |
| 23:45:45 | <tech234a> | IDK it looked like a cool name, it's short, and reminds me of ! commands used in AB |
| 23:49:25 | <@JAA> | #! could be interpreted as the factorial, which is a classical toy example for implementing recursive algorithms. |
| 23:49:48 | <@JAA> | classic* |
| 23:56:22 | <@JAA> | In a similar vein: #fibonacci |
| 23:57:03 | <tech234a> | #! could also be used to portray urgency |
| 23:57:15 | <tech234a> | or surprise |