#archiveteam-bs log for 2021-06-22

Home Search Previous day Next day

00:36:32		HackMii quits [Ping timeout: 258 seconds]
00:38:26		HackMii (hacktheplanet) joins
00:40:30		HP_Archivist quits [Read error: Connection reset by peer]
00:40:57		HP_Archivist (HP_Archivist) joins
00:41:49		HP_Archivist quits [Read error: Connection reset by peer]
00:42:15		HP_Archivist (HP_Archivist) joins
00:44:41		HP_Archivist quits [Client Quit]
01:01:21		Ruthalas8 (Ruthalas) joins
01:03:22		Ruthalas quits [Ping timeout: 258 seconds]
01:03:22		Ruthalas8 is now known as Ruthalas
01:03:39		dm4v_ joins
01:04:16		dm4v quits [Ping timeout: 250 seconds]
01:04:16		dm4v_ is now known as dm4v
01:04:16		dm4v is now authenticated as dm4v
01:04:16		dm4v quits [Changing host]
01:04:16		dm4v (dm4v) joins
01:06:31		Lord_Nightmare quits [Remote host closed the connection]
01:06:43		Lord_Nightmare (Lord_Nightmare) joins
01:08:04		Ruthalas6 (Ruthalas) joins
01:08:35		Ruthalas quits [Ping timeout: 250 seconds]
01:08:36		Ruthalas6 is now known as Ruthalas
01:12:16		Lord_Nightmare2 (Lord_Nightmare) joins
01:13:10		Lord_Nightmare quits [Read error: Connection reset by peer]
01:13:10		Lord_Nightmare2 is now known as Lord_Nightmare
01:13:21		aleph quits [Ping timeout: 250 seconds]
01:13:39		aleph joins
01:17:34	<pcr>	Wikipedia has a list of HK newspapers, anything with an online version should probably be grabbed if it isn't already. https://en.wikipedia.org/wiki/List_of_newspapers_in_Hong_Kong?wprov=sfla1
01:20:24	<@JAA>	That was mentioned a few hours ago. It would be better to first focus on ones that are critical of mainland China since those are the ones at risk.
01:29:53	<pcr>	I'd agree that those ones are probably more at risk.
01:37:58		HackMii quits [Remote host closed the connection]
01:38:25		HackMii (hacktheplanet) joins
02:00:12		Ruthalas quits [Client Quit]
02:58:40		ThreeHM quits [Ping timeout: 250 seconds]
03:00:39		ThreeHM (ThreeHeadedMonkey) joins
03:39:19		lennier2 joins
03:40:09		lennier1 quits [Ping timeout: 258 seconds]
03:40:10		lennier2 is now known as lennier1
03:50:53		qw3rty__ joins
03:54:34		qw3rty_ quits [Ping timeout: 250 seconds]
04:04:06		lennier1 quits [Ping timeout: 250 seconds]
04:04:25		lennier1 (lennier1) joins
04:16:11		monoxane quits [Ping timeout: 258 seconds]
04:28:29		monoxane (monoxane) joins
04:48:09		monoxane4 (monoxane) joins
04:50:28		monoxane quits [Ping timeout: 250 seconds]
04:50:28		monoxane4 is now known as monoxane
05:15:45		sonick quits [Quit: Connection closed for inactivity]
05:19:51	<Barto>	JAA: apple daily is definitely the biggest risk
05:23:52		nerdguy1138 (nerdguy1138) joins
05:25:55		Ruthalas (Ruthalas) joins
05:27:51		nerdguy1138 quits [Client Quit]
05:45:15		Doranwen quits [Remote host closed the connection]
05:45:37		Doranwen (Doranwen) joins
06:03:08		DogsRNice quits [Read error: Connection reset by peer]
06:04:11		achivarin quits [Remote host closed the connection]
06:44:32		KRG quits [Ping timeout: 258 seconds]
07:33:25		Monotoko joins
07:33:44	<Monotoko>	Hey guys, do you have a paid account for Apple Daily?
07:34:06	<Monotoko>	if not I can provide one
07:36:29	<Ryz>	Hello Monotoko, please stick around, other people will ask you stuff regards to Apple Daily
07:36:39	<Ryz>	Right now I think it's low activity here right now
07:36:59	<Monotoko>	Sure, I have zero experience archiving but if you're hitting paywalls or anything I'm happy to help
07:37:46	<Ryz>	JAA or arkiver or anyone involved with dealing with Apple Daily ^
07:40:36	<@OrIdow6>	J A A knows all
07:41:12	<@OrIdow6>	Thank you Monotoko, as far as we have observed, and as I know, we don't need that
07:41:47	<Ryz>	I'm curious if there's anything that can only be accessed through a paid account~
07:42:38	<@OrIdow6>	Possible
07:42:50	<@OrIdow6>	From what I've heard, both the articles and videos work fine w/o one
07:43:50	<@OrIdow6>	If there is, it would need special handling
07:44:14	<@OrIdow6>	Since it would be questionable to get paid content and make it freely available online, even if they went defunct
07:45:04	<Monotoko>	I don't believe there is, I've looked at some of the stuff I can access and it seems freely accessible - I mostly subscribed to support them back during the protests
07:45:49	<Monotoko>	at least for my account, there is "Apple Daily
07:56:59		Monotoko quits [Remote host closed the connection]
08:09:18		Zerote joins
08:12:19		Zerote__ quits [Ping timeout: 258 seconds]
08:33:36		Atom-- joins
08:34:56		Atom quits [Ping timeout: 250 seconds]
09:24:08		noteness quits [Write error: Broken pipe]
09:24:08		HackMii quits [Write error: Broken pipe]
09:24:29		HackMii (hacktheplanet) joins
09:27:03		noteness (noteness) joins
09:58:05		achivarin (achivarin) joins
10:29:12		lorwp (lorwp) joins
10:31:35		lorwp quits [Changing host]
10:31:35		lorwp (lorwp) joins
10:31:38		Matthww quits [Quit: The Lounge - https://thelounge.chat]
10:41:22		Matthww8 joins
10:50:24		LeighR (LeighR) joins
10:57:41		@chfoo quits [Remote host closed the connection]
10:57:59		chfoo (chfoo) joins
10:57:59		@ChanServ sets mode: +o chfoo
11:00:43		LeighR leaves
11:01:24		billy549 quits [Ping timeout: 250 seconds]
11:21:26		billy549 (Billy549) joins
11:32:35	<@JAA>	OrIdow6: Your wiki account is automoderated now.
11:33:57		Vista2003 joins
11:34:16	<@EggplantN>	Oridow6 I assume we’re gonna make AppleDaily a warrior project?
11:34:20	<@EggplantN>	We’ve got discovery done
11:36:16	<@EggplantN>	Hey Vista2003 so we’ve fed all the urls we’ve found of articles into our #// (urls) project. This project just grabs the page and no media etc. This was a quick effort yesterday to grab all we could while we discuss the best way to get a proper grab.
11:36:42	<Vista2003>	ok, thanks for the info
11:38:17	<@arkiver>	EggplantN: why a warrior project?
11:38:30	<@arkiver>	images?
11:38:56	<Jake>	(I think we had a AB job for the videos too?)
11:39:03	<@JAA>	Videos ran through AB last night, yeah.
11:39:13	<@EggplantN>	Yeah? Well I thought that’s what Oridow6 was up too arkiver
11:39:16	<@EggplantN>	or am I mistaken
11:39:17	<@arkiver>	and it actually for the videos as well?
11:39:56	<@EggplantN>	Either way. It’s probably time for a grab of HK media in general.
11:40:03	<@arkiver>	i mean, it AB actually got the videos as well?
11:40:14	<@JAA>	Yes, this was the .m3u8 and the .ts segments listed in them.
11:40:20	<@arkiver>	good
11:40:20	<@JAA>	jodizzle prepared that list.
11:42:45		jtagcat quits [Quit: Bye!]
11:45:49		jtagcat (jtagcat) joins
12:03:12	<@JAA>	There, created a wiki page for it collecting everything that has happened so far: https://wiki.archiveteam.org/index.php/Apple_Daily
12:06:16		KRG joins
12:06:16		KRG is now authenticated as KRG
12:07:00		KRG quits [Remote host closed the connection]
12:13:35	<rewby>	JAA: Maybe list the list of sitemap urls + urls found in sitemaps that I posted yesterday on the wiki. So we can easily find it for future steps
12:14:13		BlueMaxima quits [Read error: Connection reset by peer]
12:15:05		KRG joins
12:15:05		KRG is now authenticated as KRG
12:20:25	<@JAA>	rewby: Yup, done.
12:31:37	<@OrIdow6>	EggplantN arkiver: I wasn't working on that, just talking about it
12:31:41	<@OrIdow6>	Thank you JAA
12:32:48	<@OrIdow6>	And thank you for making this wiki page also
12:58:33	<@OrIdow6>	Anyhow, apparently GREE removes and/or makes inaccessible (I still don't understand it) some material on the 24th, not the 26th, so I need to do that
13:07:15		KRG quits [Read error: Connection reset by peer]
13:07:23		KRG` joins
13:10:08	<@arkiver>	OrIdow6: yeah i thought so
13:10:43	<@JAA>	I'm throwing a bunch of GREE stuff into AB right now.
13:11:05	<@JAA>	Just as a safety net. Chances are that'll miss a lot of it.
13:12:41	<@OrIdow6>	Thanks JAA
13:13:13	<@arkiver>	EggplantN: regarding a conversation we had some time ago, I want to move forward with a project to archive full websites through the warrior without very customized code for each
13:13:30	<@arkiver>	this could happen on #// or a different project
13:13:58	<@arkiver>	current plan is:
13:14:16	<@arkiver>	- for each website, set some config that the warrior all get
13:14:40	<@arkiver>	- this config has stuff like max depth, max time, etc. for a website
13:14:54	<@EggplantN>	So like the domains project we had for .eu
13:14:56	<@JAA>	So like #noteurdomain and #flashbang except better?
13:15:00	<@EggplantN>	Yeah
13:15:07	<@arkiver>	- a website is started by putting in an initial URL (for example the main page)
13:15:22	<@EggplantN>	Okay, I’d like to expand on that.
13:15:37	<@JAA>	Sounds nice.
13:16:03	<@arkiver>	- the warrior get a URL from the project, which includes current depth and time, depending on if time/depth are over the limit, the warrior will queue all new found URLs
13:16:21	<@arkiver>	here's the tricky part that'll need EggplantN
13:17:09	<@arkiver>	we'll need special queues for websites, so the warrior queues a URL back to the project, this URL goes into a specific queue, and URLs are only released to the tracker (or handed out from the tracker) at a certain maximum rate per second
13:17:38	<@arkiver>	i believe we previously had rabbitmq queues in mind, but not sure
13:17:41	<@EggplantN>	Okay. I see.
13:18:15	<@arkiver>	EggplantN: JAA: yes, sort of like those two
13:18:48	<@OrIdow6>	So basically a distributed AB with slightly less grab-time supervision?
13:19:07	<@JAA>	'slightly' lol
13:19:10	<@arkiver>	yes, though the config will support excludes, etc.
13:19:26	<@arkiver>	will just take a minute or so for new configs to be distributed to all warriors
13:20:11	<@EggplantN>	Okay, yes I like this. So basically a xml/Json/YAML stored in BunnyCDN. Containing URL/URLs, ignore lists, site map links if needed. Then max time, size and amount of links. That can get send to a “master” of such. That will initiate the grab, get all the details and send the info to the tracker and then using some code in between it will
13:20:11	<@EggplantN>	validate URLs/limits etc and then feed to specific backfeed queues limited to x/minute releases
13:20:35	<@EggplantN>	in between of backfeed and the warriors
13:21:41	<@arkiver>	max size may not work, i was planning that the item a warrior receives is like 'DEPTH;TIME;URL'
13:21:48	<@EggplantN>	I’ve got a plan. Gimme a few hours I’ll draw it up for you.
13:22:14	<@JAA>	Time limit would be for that one URL download?
13:22:37	<@arkiver>	then the warrior adds 1 to the depth, records time (and adds to old time), and queues new items as 'DEPTH+1;TIME+DOWNLOAD_TIME;URL'
13:22:44	<@JAA>	Or additive over all depths so far?
13:22:48	<@arkiver>	additive JAA
13:23:06	<@JAA>	Right, so I guess that would be measured in seconds then.
13:23:12	<@arkiver>	for example yeah
13:23:23	<@JAA>	Unless a site is extremely slow, that is.
13:23:48	<@arkiver>	other method would be to record time somewhere, and add a flag to the config that the warrior should stop archiving the site if the time limit is broken
13:25:02		sonick (sonick) joins
13:25:06	<@arkiver>	if we want to limit total URLs, we'll need some central command server, but for now i was not thinking of implementing that yet, just the queues from EggplantN and release from those queues at certain rate
13:25:21	<@arkiver>	(and config files, pulled every minute or so)
13:25:29	<@arkiver>	and then we can expand on that as we go along
13:25:30		Doranwen quits [Ping timeout: 258 seconds]
13:26:29	<@EggplantN>	Yeah I know what you mean. Ive got some ideas
13:26:43	<@arkiver>	nice!
13:30:11	<@arkiver>	this project can be used for sites that we need to archive fast, or sites that are somewhat too large for AB, but too small (or too messy) for a custom warrior project
13:31:15	<@arkiver>	(appledaily is a pretty good example)
13:33:58		Doranwen (Doranwen) joins
13:39:27	<Vista2003>	I've been running the Warrior program for a bit now and I've noticed that uploads keep failing as your servers have reached the maximum amount of connections
13:41:29	<@OrIdow6>	Vista2003: If this is just within the last few hours, it's known
13:41:48	<@OrIdow6>	(Though I don't know the details)
13:43:24	<@EggplantN>	Yep BunnyCDN was offline. I need to restart some things in 30 mins
13:44:39	<Vista2003>	OrIdow6 Yep, it's in the last few hours
13:59:20	<@EggplantN>	okay fixing now Vista2003
14:07:53		ragu__ quits [Client Quit]
14:08:25		DogsRNice (Webuser299) joins
14:23:38	<@EggplantN>	So arkiver my plan was a bit more complex but nicer long term.
14:23:38	<@EggplantN>	With the effective "SpeedAB" I was planning to do a central server to manage it. effectively we would send it a domain/list of URLs similar to AB, parameters like sitemaps links, ignore sets, max size, max time, max URLs (i think you mean depth by this), offsite links y/n
14:23:38	<@EggplantN>	To do this, once the master gets the initial URL/s it sends them to a specific REDIS queue in either the urls/custom project. This queue has a limit of x/min (requires TP integration) they would likely be added as <ID>:url to the queue.
14:23:38	<@EggplantN>	Once the warrior has grabbed the item/items it then grabs a "master config" for that <ID> from either wasabi/BCDN S3 this contains ignore sets, and any other configs the warriors may require.
14:23:38	<@EggplantN>	As it grabs the page and discovers extra links it sends them to a different backfeed, this checks that we have not exceeded our time, size or url limit before sending them to the main backfeed.
14:23:39	<@EggplantN>	As the items are returned to the tracker they update the main redis size/item count column for the whole project, THEN stores a second one for the <ID>. (requires TP integration)
14:23:39	<@EggplantN>	Every 15-60 seconds the master checks the size/item key, if exceeded sets the x/min to 0 for that ID. Admins can override/increase the size/item count to resume grabbing if they wish.
14:23:40	<@EggplantN>	Once project is finished either no more items or limits are exceeded and admins MARK the job as done (decide not to up them/continue) the remaining items in redis are destroyed and any specific keys also.
14:24:06	<@EggplantN>	Complicated I know, but effectively mimics AB in a speedier, more efficient and usable way
14:24:43		atari800 joins
14:28:19	<atari800>	Question? I'm trying to run archivebot under docker for the first time. Getting "Error loading pipeline"..."Project (whatever) did not install correctly" for everything except urlteam. Can I fix that? Googling didn't provide much I found useful
14:28:43	<@EggplantN>	Just wondering why you're trying to run ArchiveBot first atari800
14:29:15	<atari800>	I'm sorry- I meant Warrior.
14:29:32	<@EggplantN>	Which project isn't working and which Docker Image are you using
14:31:12	<atari800>	reddit, for instance, isn't working, but also github, google sites, and so on. Using archiveteam/warrior-dockerfile:latest
14:33:36	<@arkiver>	EggplantN:
14:33:37	<atari800>	Let's see: it came from https://hub.docker.com/r/archiveteam/warrior-dockerfile/
14:33:42	<@EggplantN>	Use atdr.meo.ws/archiveteam/warrior-dockerfile:latest
14:33:48	<@arkiver>	what is "stores a second one for the <ID>"?
14:33:48	<@EggplantN>	Kaz needs to update the DH one manually
14:34:03	<@EggplantN>	a second size/item count arkiver
14:34:11	<@EggplantN>	so we can see that jobs total
14:34:38	<@arkiver>	i think it sounds good
14:34:43	<atari800>	EggplantN2 thanks, will try that now.
14:34:54	<@arkiver>	but is this doable for now, or is this going to be too complicated to set up in a reasonable amount of time?
14:35:06	<@EggplantN>	define reasonable amount of time?
14:35:06	<@arkiver>	else we can always still go with the initial simpler idea and build upon that
14:35:23	<@arkiver>	though this is indeed a nice idea
14:35:35	<@arkiver>	reasonable amount is perhaps few weeks or a month or so?
14:35:52	<@EggplantN>	oh sure I think i just need to chat to rewby about his feelings on the design.
14:35:57	<@EggplantN>	and then have the trackerproxy mods
14:36:08	<@arkiver>	i'm not a fan of "SpeedAB" btw
14:36:10	<@EggplantN>	They aren't overly complicated TP mods I dont think
14:36:14	<@EggplantN>	well whatever we call it lol
14:36:16	<@arkiver>	but the name is a detail
14:36:23	<atari800>	"Invalid Docker Repository Url: http://atdr.meo.ws/archiveteam/warrior-dockerfile:latest"
14:36:37	<@JAA>	atari800: No http://
14:36:53	<@EggplantN>	https://irc.fu.is/uploads/b51af014949272e6/image.png
14:36:57	<@EggplantN>	ah yes that
14:37:00	<@EggplantN>	no http://
14:37:17	<atari800>	without the http: "Registry returned bad result."
14:37:28	<@EggplantN>	try once more?
14:37:31	<@EggplantN>	what command?
14:38:27	<@arkiver>	and on warrior getting the config, that will happen periodically, not every every item (or multi-item) it receives
14:38:39	<@arkiver>	JAA: do you have opinions on what EggplantN wrote?
14:38:44	<@EggplantN>	yeah, thats up to you, but you get the general idea
14:38:50	<@arkiver>	yep
14:39:01	<@EggplantN>	that config can also have UA's in it even etc
14:39:05	<@EggplantN>	like AB
14:39:31	<atari800>	"Invalid Docker Repository Url: atdr.meo.ws/archiveteam/warrior-dockerfile:latest"
14:39:35	<@arkiver>	sure, can have pretty much anything EggplantN
14:39:44	<@EggplantN>	thats what I was thinking
14:39:56	<@EggplantN>	what command are you running atari800
14:40:03	<@JAA>	EggplantN: Depth = recursion level. Root = level 0, links on it = level 1, links on those = level 2, etc. AB goes to infinite depth, but wpull has an option to limit it. That's what arkiver meant by depth.
14:40:15	<@JAA>	But an overall URL count limit is also a good idea.
14:40:35	<@arkiver>	both max URL count and max depth would be good to have
14:40:39	<@JAA>	Yep
14:40:39	<@EggplantN>	yeah it would.
14:41:00	<@arkiver>	i doubt we can create a channel name for this as awesome as #//
14:41:07	<@EggplantN>	oh we can
14:41:09	<atari800>	EggplantN I'm using the docker interface on synology. Just choosing "add from URL" and pasting it in.
14:41:10	<@JAA>	But it requires extra data for each item.
14:41:18	<@EggplantN>	#//
14:41:20	<@EggplantN>	wait fuck sec
14:41:24	<@JAA>	Because a URL can be found in multiple places, and we'd only want to keep it at the lowest depth.
14:41:29	<@EggplantN>	#🍆
14:41:35	<@JAA>	-1
14:41:38	<@EggplantN>	EMOJI NAMES
14:41:42	<@arkiver>	yeah no
14:41:44	<@EggplantN>	lol
14:41:50	<@JAA>	#\\ ?
14:41:54	<@JAA>	:-P
14:42:02	<@JAA>	Guaranteed to not confuse anyone.
14:42:02	<@EggplantN>	#://
14:43:30	<rewby>	#\o/
14:43:49	<rewby>	Or #/o\ since that's how people feel when things go down
14:44:09	<rewby>	Or maybe #/!\
14:44:31	<@JAA>	atari800: I know someone had issues before with Synology's poor integration of Docker. Don't remember the details though.
14:44:36	<atari800>	If I go to atdr.meo.ws/archiveteam/warrior-dockerfile or even atdr.meo.ws in my browser manuall I just get "nope". Is that expected behavior?
14:44:45	<@JAA>	Yes
14:44:49	<atari800>	ok
14:45:01	<atari800>	JAA Hrm. :(
14:45:22	<@JAA>	Grepped my logs, don't see a solution there, just 'doesn't work'.
14:46:02	<@JAA>	Or well, no solution with the UI. They got it working through the CLI (surprise!).
14:47:41	<atari800>	Hm. thanks. I'm new to both synology and docker so this may be beyond me rn. Just thought it would be fun to run the warrior on another machine besides my Mac
14:49:11	<@JAA>	#recursionisrecursion is a bit long. :-/
14:50:03	<@JAA>	Oh also, separate WARCs by target site please.
14:50:42	<@arkiver>	yeah that can be done with some prefix to the WARC returned by the warrior
14:51:00	<@JAA>	Yep, with changes in the factory I guess.
14:51:03	<@EggplantN>	separate warcs by target site is the hardest thing
14:51:05	<@EggplantN>	but we can do
14:51:18	<@arkiver>	hardest thing?
14:51:25	<@JAA>	Nah
14:51:27	<@arkiver>	sounds like one of the easiest parts
14:51:31	<@EggplantN>	probably, i can think of a way to do everything else but that lol
14:51:33	<@JAA>	Coming up with a good channel name is the hardest part.
14:51:38	<@arkiver>	^
14:51:41	<@EggplantN>	true
14:52:08	<@EggplantN>	issue with factory is we then need a way to tell it that project is done
14:52:24	<@EggplantN>	i mean, we can do that but it means more effort lol
14:52:27	<@arkiver>	well
14:52:43	<@JAA>	Unless the factory just groups together files by prefix without caring about the actual prefix value.
14:52:45	<@arkiver>	JAA: would you be fine with no separate items per project?
14:53:12	<@arkiver>	so each item on IA can contain megaWARCs from different sites
14:53:46	<@arkiver>	like how in the case of zstd, each item on IA can contain multiple megaWARCs (one for each dictionary ID)
14:53:48	<@JAA>	As long as each megaWARC only contains records from one target site, yeah, I'm fine with that.
14:53:52	<@arkiver>	yeah
14:53:57	<@arkiver>	so that is fixed EggplantN
14:54:09	<rewby>	Terrible project idea name: GottaGrabEmAll
14:54:16	<@arkiver>	we'll do it like how we have multiple megaWARCs/item for each dictionary
14:54:27	<@arkiver>	rewby: it is indeed terrible :P
14:54:49	<rewby>	If there's one thing I'm good at it's bad names and groan-inducing puns
14:54:52	<@JAA>	ARCS, short for ARCS Recursively Crawls Sites
14:54:59	<@EggplantN>	go old school
14:55:00	<@EggplantN>	DPOS
14:55:07	<rewby>	Thats a name conflict with the ARC format, JAA
14:55:24	<rewby>	I like DPOS
14:55:25	<@JAA>	Yeah, but ARC sucks. :-P
14:55:35	<@JAA>	Nah, DPoS is already the general term for our distributed projects.
14:56:07	<@EggplantN>	hrm, we could send them as the <ID> in the warc prefix. the chunker then could split them via ID, rename them based on the config from the ID (same place warrior gets it). Once its done that, the only thing we gotta do is make a way of telling the chunker to move the file size (if smaller than chunker size) to packing-queue once project is done
14:56:24	<rewby>	DGrab - Distributed Grab (Site)
14:56:33	<@arkiver>	EggplantN: i'm fine with not renaming, just using the prefix on the WARCs
14:56:42	<@EggplantN>	well that works.
14:56:51	<@arkiver>	EggplantN: we dont have to check if a project is done
14:57:01	<@arkiver>	we simply move x amount of WARCs like we do now
14:57:07	<Jake>	atari800: I think you'll need to add it to the repository list in the UI I believe? https://kb.synology.com/en-us/DSM/help/Docker/Docker?version=6
14:57:12	<@arkiver>	then pack them into multiple megaWARCs according to their prefix
14:57:20	<@arkiver>	then upload all resulting megaWARCs to same item
14:57:29	<@arkiver>	like what we do with multiple dictionaries for zstd
14:57:30	<@JAA>	WARCS: Where ArchiveTeam Recursively Crawls Sites
14:57:42	<rewby>	JAA pls
14:57:46	<rewby>	And I thought I was bad
14:57:48	<@EggplantN>	so. we will have multiple sites in the same MegaWARC?
14:57:49	<@JAA>	lol
14:58:09	<@JAA>	You challenged me. :-P
14:58:13	<@arkiver>	EggplantN: no
14:58:39	<rewby>	GrabbyMcGrabFace
14:59:10	<@arkiver>	multiple megaWARCs (one for each 'site'/prefix), all into a single item
14:59:13	<@OrIdow6>	#:(){:\|:&};:
14:59:18	<@EggplantN>	so. thats the issue. so lets say, chunker size is 5GB. project is 83GB, theres 16 megawarcs at 5GB, and then 3GB sat in the chunker. waiting to become 5GB, we need a way to tell the chunker to stop waiting for it to be 5GB and move to the packing queue.
14:59:21	<@arkiver>	total being the max size we now have per megaWARC
14:59:46	<@JAA>	OrIdow6: Heh, I was actually thinking about something like that before. #f(f())
14:59:47	<rewby>	EggplantN: Maybe just say like "if no new data for 24 hours, consider the warc done"
14:59:52	<@arkiver>	EggplantN: the other 3 GB will sit waiting, and be processed and uploaded when another site fills it to 5 GB
15:00:03	<@EggplantN>	right, but that means that 2 sites are in the same megawarc
15:00:08	<@arkiver>	no
15:00:12	<rewby>	Yes it does?
15:00:13	<@arkiver>	multiple megaWARCs per item
15:00:25	<@EggplantN>	right now how the factory is
15:00:30	<rewby>	Egg is talking about the chunker, not the uploader
15:00:32	<@EggplantN>	you would have multiple sites in 1 megawarc
15:00:33	<@arkiver>	remember how we can have multiple megaWARCs if there's multiple dict IDs at the same time?
15:00:39	<@arkiver>	no
15:00:46	<@arkiver>	JAA: do you get what i mean?
15:00:47	<@EggplantN>	im confused.
15:01:03	<@EggplantN>	yes we can have different dict IDs in 1 megawarc, I get that
15:01:11	<@JAA>	Yeah, separate dicts for each target site, so the existing factory already does the splitting automatically.
15:01:11	<@arkiver>	no
15:01:13	<@arkiver>	we cant
15:01:18	<rewby>	Here's a project name idea: #Y - for the lambda Y combinator that does recursion in lambda calculus
15:01:21	<@arkiver>	yes like JAA says
15:01:24	<@EggplantN>	it does?
15:01:26	<@EggplantN>	uh
15:01:29	<@arkiver>	yeah
15:01:37	<@JAA>	It must, because a .warc.zst can only have one dict.
15:01:44	<@EggplantN>	checks factory as I was not aware it did that
15:01:46	<@arkiver>	will try to find an example hold on
15:02:07	<@OrIdow6>	JAA: Unfortunately, the # makes it a comment
15:02:22	<rewby>	#Y is free so we could use that
15:02:30	<@JAA>	OrIdow6: Yeah. I also wonder how many people would have trouble joining that. It would be glorious.
15:02:37	<@OrIdow6>	If anyone knows of a language where the channel name is an executable fork bomb... I'd like to hear it
15:02:40	<@OrIdow6>	Yeah, good point
15:03:11	<rewby>	OrIdow6: I like #Y because it's the lambda calculus Y combinator (I.e. how you do recursion in lambda calculus)
15:03:23	<jodizzle>	JAA, arkiver, Jake: FYI, the video list was incomplete (it's why I put '.1.txt' in the filename). Sorry if that was unclear. Essentially I took rewby's article list and started iterating through them, soaking up the .m3u8s. I'll have more lists to run assuming the site holds up, though I guess I should share the lists in #//.
15:03:24	<@arkiver>	EggplantN: https://archive.org/download/archiveteam_reddit_20200727111749_177315d9
15:03:45	<@arkiver>	one megaWARC for each dict ID
15:04:00	<@arkiver>	we'll do the same for the domains project, one megaWARC for each 'site'/ID prefix
15:04:04	<@EggplantN>	I was completely unaware it did that
15:04:04	<@arkiver>	but together in the same item
15:04:06	<@JAA>	jodizzle: Ah, right, I did notice that '.1' actually. Any idea how many more? AB handled the first one just fine.
15:04:17	<@EggplantN>	me and rewby didnt even know that >_>
15:04:30	<@arkiver>	so the remaining 3 GB will be handled when another project fills it up to 5 GB (and then there'll be two megaWARCs)
15:04:36	<@arkiver>	EggplantN: well now you know :)
15:04:46	<rewby>	Yeah, that's quite interesting
15:04:47	<@arkiver>	i'll update the megaWARC factory later to handle that
15:04:50	<@EggplantN>	yeah I get you now, okay so we need only to change megawarc then.
15:04:53	<atari800>	Jake So would I add "https://atdr.meo.ws" to the registry? (Trying that I get "Failed to query registry.")
15:05:27	<Jake>	atari800: try just "atdr.meo.ws"? (I've honestly never used a Synology before)
15:06:26	<@JAA>	rewby: Why not #λ then? :-)
15:06:29	<atari800>	It wants http:// or https://
15:06:42	<@arkiver>	i'll make sure we write to the metadata of the item which sites the item contains when we move it out of #//
15:07:06	<@JAA>	Oh wait, nevermind.
15:07:32	<jodizzle>	JAA: No, I'm not sure how many more. The first list was covering only the first ~300k articles I think.
15:08:02	<rewby>	JAA: Because I can't type that with my keyboard
15:08:55	<@JAA>	Oh, multi-item requests would also have to only return URLs for one target site.
15:09:37	<@JAA>	rewby: Weak. Ctrl+Shift+U 3bb
15:10:00	<rewby>	JAA: Doesn't work on my system.
15:10:17	<rewby>	This terminal has no unicode input support
15:10:25	<rewby>	Oh wait
15:10:26	<@JAA>	lol
15:10:27	<rewby>	Yes it does
15:10:43	<rewby>	λ It just doesn't show me it's actually doing unicode input
15:10:45	<rewby>	gg terminal
15:11:05	<rewby>	I mean, I like #Y because it's specifically the fixpoint combinator
15:11:44	<@JAA>	Yeah, that's what I realised and why I wrote 'Oh wait, nevermind.'. :-)
15:11:53	<rewby>	Ah I see
15:11:59	<rewby>	Yeah, that's the joke I was going for
15:12:15	<rewby>	It's short like #//
15:12:27	<rewby>	And the channel is free
15:13:02	<rewby>	(well, not anymore since Im' in there now. But registering it to AT is easy enough)
15:13:53	<atari800>	Is there a chance that archiveteam/warrior-dockerfile will get updated at some point so I don't need to fight with meo.ws?
15:14:43		KRG joins
15:14:43		KRG is now authenticated as KRG
15:14:52	<@EggplantN>	yes it will
15:14:54		KRG` quits [Ping timeout: 250 seconds]
15:14:55	<@EggplantN>	it needs Kaz to do it
15:15:00	<@EggplantN>	or i think arkiver might be able too
15:15:31	<@JAA>	OrIdow6: You could probably build a recursive thing in C with a valid channel name as lines with # are preprocessor directives, not comments. Not sure how to actually do it though.
15:16:34	<@JAA>	Actually, maybe not because all directives require arguments with a space. :-/
15:16:43	<@OrIdow6>	Yeah
15:16:45	<@JAA>	EggplantN: What does it need?
15:16:56	<@EggplantN>	building again
15:17:03	<@JAA>	On DH?
15:17:08	<@EggplantN>	ya
15:17:13	<rewby>	Can't you just pull the image from meo.ws , retag and push
15:17:45		KRG quits [Remote host closed the connection]
15:18:05		KRG joins
15:18:05		KRG is now authenticated as KRG
15:18:56	<@JAA>	Nope, don't see how I can do that on DH's side.
15:19:23	<@JAA>	Since it's not connected to a repo, needs manual pushing it seems.
15:19:23	<rewby>	You can do it from your local docker client
15:19:23		KRG` joins
15:19:35	<@Kaz>	hello
15:19:36	<@Kaz>	what do you need
15:19:51	<@JAA>	Update of the Docker Hub warrior image
15:22:25		KRG quits [Ping timeout: 258 seconds]
15:22:50	<@Kaz>	the answer may be no
15:22:59	<@Kaz>	did they finally turn off builds for free accounts?
15:24:03	<@Kaz>	yeah, someone's actually gonna have to build it manually, then push it back up
15:24:04	<@Kaz>	wack
15:24:08	<@OrIdow6>	EggplantN: How does the process of ending a grab once size, time, etc. limit is exceeded? They stop going into the queue ("this checks that we have not exceeded our time"), but does it immediately stop it or let the queue empty?
15:24:24	<@OrIdow6>	Dictionary IDs might concievably collide if you kept them around forever
15:24:33	<@JAA>	Kaz: Yeah, looks like it. But we have the image on atdr. Can just pull/push that, no?
15:24:48	<@Kaz>	in theory yes
15:24:58	<@Kaz>	I mean, the 'building it' bit isn't a problem anyway
15:25:07	<@Kaz>	i just don't have the right things in front of me to do it atm
15:25:53	<@JAA>	Right
15:26:10		Arcorann quits [Ping timeout: 250 seconds]
15:26:12		katocala quits [Remote host closed the connection]
15:26:50	<@JAA>	Hold my beer, I'm going to fuck this up. :-)
15:27:13		rewby holds the beer
15:33:33		katocala joins
15:33:49		katocala is now authenticated as katocala
15:34:06	<@JAA>	Pushing is horribly slow for some reason, but it's going.
15:35:49	<@JAA>	Like 100 kB/s slow.
15:40:38	<@JAA>	atari800: I need to leave for a bit, but try archiveteam/warrior-dockerfile again in 10-15 minutes or so. Should be pushed by then.
15:41:04	<@JAA>	(Also, fuck Synology for making it so hard to use a custom registry.)
15:41:12	<atari800>	JAA thank you!
15:41:14	<rewby>	Maybe we should setup the CI to automatically push to DH as well?
15:41:44	<@JAA>	No, we should stop using DH altogether. But crap like this breaks that plan.
15:58:36	<h3ndr1k>	atari800: There are instructions on how to add custom registries. https://kb.synology.com/en-global/DSM/help/Docker/Docker?version=6#b_19
15:58:36	<h3ndr1k>	But I don't have any synology devices either, so cannot help otherwise.
15:59:33	<atari800>	@JAA I re-loaded it, and it works now. Thank you!
16:00:59	<atari800>	h3ndr1k thanks. I did see that earlier - It didn't work for me as easy as as all that. But I'll keep that URL for next time.
16:07:10	<@EggplantN>	so OrIdow6 the "ending a grab" if limits are exceeded actually just pauses it. It sets the items/min counter to 0 for that site. any "out" jobs can continue to return URLs to that queue. Admins will end the job (via a button, irc bot, command line) and this will empty the queues OR admins can up the limit if the feel
16:08:49	<@OrIdow6>	EggplantN: Oh
16:09:00	<@EggplantN>	That way we have more control.
16:09:02	<@EggplantN>	Was my idea
16:09:10	<marked>	maybe the recursive redesign discussion fits better in -dev ?
16:09:16	<@OrIdow6>	Yeah, sounds good
16:09:27	<@OrIdow6>	The queue thing, not channel thing
16:09:41	<@OrIdow6>	marked: We were discussing setting up a specific channel for it
16:09:55	<@OrIdow6>	It is better to spam up bs a bit than to split it b/t 3 channels
16:09:55	<atari800>	Woo, got two warriors running on the Synology! Thanks team.
16:10:35	<atari800>	(...running Reddit this time, not just urlteam)
16:12:13	<Jake>	I assume just with the docker hub one?
16:13:11	<atari800>	@jake yes.
16:13:24	<Jake>	ah, well glad we could get it working.
16:14:29	<@EggplantN>	My idea is to have as many features as AB has, but a ton faster/more flexible. Oridow6 if you have any design ideas do let me know
16:15:27		Wingy0 (Wingy) joins
16:15:42		Wingy quits [Ping timeout: 258 seconds]
16:15:42		Wingy0 is now known as Wingy
16:16:07	<rewby>	Still vote for #Y for the channel
16:17:00	<@arkiver>	EggplantN: i'll leave that to kaz, never did it before. might have credentials somewhere
16:17:32	<@arkiver>	rewby: why?
16:18:33	<rewby>	arkiver: We were discussing the channel name for recursive grabs. And Y is the fixpoint combinator in lambda calculus. That is, it's how you do recursion in lambda calculus. On top of that it's nice and short.
16:18:40	<rewby>	And the channel is free
16:18:59	<@OrIdow6>	EggplantN: I've had thoughts on this before but haven't listed them out, I'll try to think of them
16:19:04	<@arkiver>	right
16:19:24	<@arkiver>	EggplantN: any idea when the queuing may be done?
16:19:54		Wingy quits [Ping timeout: 250 seconds]
16:20:41	<@arkiver>	well the implementation of that
16:20:58	<@arkiver>	might want to ask fus l for input as well
16:21:27	<@OrIdow6>	rewby: Do you know if there are any contexts where it is represented w/ punctuation? Would be nice to continue the #// theme even further
16:21:40	<rewby>	Not that I know of
16:21:48	<@arkiver>	right, though lets not make thing too complicated
16:22:00	<@arkiver>	#// may have been a nice very nice one-off
16:22:25	<@arkiver>	things*
16:22:27	<@OrIdow6>	Well I'm in #Y right now anyhow
16:22:40		atari800 quits [Ping timeout: 244 seconds]
16:23:56	<@arkiver>	OrIdow6: can be, still not a big fan
16:26:57	<@OrIdow6>	arkiver: Besides #recursionisrecursion ("too long" according to its suggester) and some variation on "SpeedAB", I think it's the only non-punctuation one suggested
16:27:13	<@EggplantN>	"EggplantN: any idea when the queuing may be done?"
16:27:16	<@EggplantN>	what do you mean by this
16:27:49	<@arkiver>	EggplantN: when we will be able to get submit some item from the warrior through backfeed, it ending up in your queues, the queues feeding back at a certain rate to the tracker
16:28:18	<@arkiver>	(feeding back to tracker depending on how many todo items there are in the tracker, else we get a build up of URLs)
16:28:23	<@EggplantN>	oh I'd like something within a couple weeks
16:28:54	<@EggplantN>	it does need those 2 TP mods. One is already kinda used before
16:29:18	<@arkiver>	we can discuss that with Fusl
16:30:13	<@arkiver>	(unless you feel like like figuring out the changed to trackerproxy)
16:30:27	<@EggplantN>	probably safer for Fus l
16:31:12		Wingy (Wingy) joins
16:53:16		save_fn quits [Ping timeout: 250 seconds]
16:56:01		marked quits [Quit: The Lounge - https://thelounge.chat]
16:56:45		marked joins
17:28:10		Daloader joins
17:33:08		Mateon1 quits [Ping timeout: 258 seconds]
17:33:35		Mateon1 joins
17:41:06	<@JAA>	I'm looking into the English Apple Daily now. I can confirm that the AB jobs are incomplete due to the infinite scrolling crap.
17:43:40	<@arkiver>	can we do the same as with the hk version>
17:43:41	<@arkiver>	?
17:43:53	<@JAA>	No, the sitemaps on en are rubbish.
17:44:04	<@JAA>	They only go back one month.
17:46:07	<@JAA>	Or you mean regarding videos etc.?
17:46:54	<@JAA>	It looks like the pagination is actually surprisingly sane and might even work in the WBM.
17:46:59	<@JAA>	Just scripty.
17:47:30	<@JAA>	I'm collecting the pagination URLs now, will run those through AB and then get the article URLs from that.
17:48:12		Vista2003 quits [Remote host closed the connection]
17:49:27	<@JAA>	And then we can feed those into AB, #//, and whatever else.
17:49:32	<@JAA>	The more, the merrier. :-)
17:52:08	<@arkiver>	yeah duplication on this is not a problem
18:08:16		KRG` quits [Remote host closed the connection]
18:19:28		guest joins
18:42:42		balrog quits [Quit: Bye]
18:52:46		atari800 joins
18:56:08		balrog joins
18:56:08		balrog is now authenticated as balrog
19:01:30	<@JAA>	EggplantN: For #// because can't hurt: https://transfer.archivete.am/12W4cM/en.appledaily.com-articles.zst
19:03:50	<@EggplantN>	Do you have access to the BB host?
19:03:54	<@EggplantN>	If not gimme 4 hrs
19:03:59	<@EggplantN>	I’m getting arseholed
19:04:23		atari800 quits [Ping timeout: 244 seconds]
19:05:17	<@JAA>	I do but have no idea how to queue things.
19:05:53	<@JAA>	It's running through AB anyway, so should be fine, but also can't hurt to grab a second copy.
19:08:07	<@JAA>	I.e. if you can't tell me how to do it because you're busy, no worries. :-)
19:13:06	<Jake>	JAA: [2021-01-18T18:17:30.181Z] <@Fus l> blackbird under /root/scripts/ [2021-01-18T18:17:35.769Z] <@Fus l> blackbird under /root/scripts/urls/ [2021-01-18T18:17:59.194Z] <@Fus l> `cat /data/urls/LIST.txt \| node urls.js > /data/urls/LIST.txt.failed`
19:14:27	<Jake>	(no idea if these are the right instructions, but I remember it being talked about in my logs)
19:15:06		KRG joins
19:15:06		KRG is now authenticated as KRG
19:15:19	<@EggplantN>	That
19:15:22	<@JAA>	Ack
19:15:24	<@JAA>	Thanks Jake!
19:15:26	<@EggplantN>	but store big data in /data
19:15:33	<@JAA>	This is tiny, just 9k URLs.
19:15:34	<@EggplantN>	or whatever the Zfs store is
19:15:38	<@EggplantN>	oh that’s fine
19:15:46	<Jake>	np. ;)
19:15:48	<@EggplantN>	just yeet it through URLs.js
19:16:02	<@EggplantN>	it does URL validation if yours was shit :P
19:16:27	<@JAA>	grep from HTML? :-P
19:17:10	<@JAA>	jodizzle: I found a few videos on en.appledaily.com that have MP4 videoUrls rather than M3U8. Did you take that into account for hk?
19:18:45	<jodizzle>	No, I didn't. Though I also haven't found any.
19:18:54	<jodizzle>	I should probably expand my regex.
19:19:50	<@JAA>	For en, I was able to get a list of all articles with videos from the pagination because they have a triangle play icon thingy. Then my regex from the article pages was just 'videoUrl'. :-P
19:21:01	<@JAA>	en.appledaily.com-articles.zst is queued to #//.
19:22:49		Lord_Nightmare quits [Client Quit]
19:24:00	<@JAA>	Welp, the site is dead.
19:25:45	<jodizzle>	Interesting. hk seems to have a similar triangle icon in the article listings, but there are definitely articles that have m3u8s without having that icon.
19:25:50	<jodizzle>	en is dead?
19:26:01	<@JAA>	Oh
19:26:14	<@JAA>	Ok, I'll check for that when the AB job finishes.
19:26:25	<@JAA>	Yeah, en's timing out. Guess #// was too much for it.
19:27:51	<@JAA>	There we go, it's back.
19:28:40	<jodizzle>	It's possible that the videos on the articles without the icon are generic videos as opposed to being "real" content, though. I haven't checked extensively.
19:29:01	<@JAA>	¯\_(ツ)_/¯ Archive ALL the videos!
19:29:04	<jodizzle>	I haven't gotten any URL dupes in my m3u8 collecting, at least.
19:29:07	<jodizzle>	:)
19:30:45		Lord_Nightmare (Lord_Nightmare) joins
19:31:25	<@JAA>	Oh yeah, the on-site video player (at least on en) also does some signature URL parameter stuff. Doesn't actually matter as you can access the content without the signature, but that may well break WBM playback. :-/
19:32:20	<@JAA>	The signature is calculated in the browser though with the secret in a JS string. lol
19:36:34	<jodizzle>	That's awful.
19:38:22		Daloader quits [Ping timeout: 250 seconds]
19:43:53	<@EggplantN>	JAA is the site still dead?
19:43:55	<@EggplantN>	if so
19:44:01	<@EggplantN>	login to tracker.at
19:44:04	<@JAA>	No, it recovered a couple minutes later.
19:44:09	<@EggplantN>	Ah okie
19:44:34	<@JAA>	Might need to check later whether they went through or it just got the #// workers banned.
19:44:51	<@EggplantN>	Well
19:45:02	<@EggplantN>	Do you have tracker ssh?
19:45:27	<@EggplantN>	if so. Please check .bash_history for a command called projectcli
19:45:34	<@EggplantN>	run the function command
19:45:38	<@EggplantN>	then run
19:45:45	<@EggplantN>	projectcli urls
19:45:51	<@EggplantN>	del urls:maxtries
19:45:59	<@EggplantN>	and I’ll figure that later
19:46:06	<@EggplantN>	it’ll stop urls getting removed from the queue
19:46:14	<@JAA>	Ah
19:46:31	<@EggplantN>	*claims
19:46:34	<@JAA>	Done
19:47:15	<@EggplantN>	Cheers. I’ll get you data later
19:47:25	<@JAA>	Thanks :-)
19:49:51		guest quits [Ping timeout: 244 seconds]
19:52:25	<@JAA>	/stdin\ : 1.29% (1099021 => 14154 bytes
19:52:34	<@JAA>	How, zstd? How the fuck are you doing this? lol
19:57:25	<rewby>	EggplantN: Have you considered adding projectcli to the bashrc
20:02:56	<@EggplantN>	I need to
20:02:57	<@EggplantN>	yes
20:03:11	<@EggplantN>	I just haven’t yet
20:03:29	<@JAA>	This sounds awfully familiar...
20:19:28	<tech234a>	Need a channel name? How about #!
20:20:47	<tech234a>	Looks like everyone is in #Y already though
20:26:10		nerdguy1138 (nerdguy1138) joins
20:26:12		nerdguy1138 quits [Client Quit]
20:28:46		Webuser431 is now known as SomeRando
20:34:54		mary quits [Quit: Exiting]
20:54:26		thuban joins
21:46:04		Larsenv quits [Quit: ZNC 1.8.2+deb1+focal2 - https://znc.in]
21:52:47		lorwp quits [Client Quit]
21:53:27		lorwp (lorwp) joins
21:58:47		lorwp quits [Client Quit]
21:59:25		lorwp (lorwp) joins
22:11:34	<@JAA>	I can confirm that there are no other videos on en.appledaily.com than the one I found from the pagination according to the WARC from 18htae9bm7std751bvsz9qcuo.
22:11:43	<@JAA>	the ones*
22:15:37		Ruthalas quits [Client Quit]
22:22:51	<kpcyrd>	did somebody look into sks?
22:36:02		Ruthalas (Ruthalas) joins
22:53:30	<Jake>	what is sks?
22:53:57	<@EggplantN>	no clue
23:26:51	<@JAA>	I declare en.appledaily.com covered. Pagination, article pages, images (main images listed in the pagination, but I haven't seen any articles with more than one image), and videos have all been archived.
23:32:10	<marked>	there's some sister publications here but hard to say what's most at risk https://en.wikipedia.org/wiki/Next_Digital
23:33:30	<@JAA>	Yeah, Next Digital is the parent company of Apple Daily.
23:45:09	<@arkiver>	tech234a: i'm open to changing
23:45:11	<@arkiver>	why #! ?
23:45:45	<tech234a>	IDK it looked like a cool name, it's short, and reminds me of ! commands used in AB
23:49:25	<@JAA>	#! could be interpreted as the factorial, which is a classical toy example for implementing recursive algorithms.
23:49:48	<@JAA>	classic*
23:56:22	<@JAA>	In a similar vein: #fibonacci
23:57:03	<tech234a>	#! could also be used to portray urgency
23:57:15	<tech234a>	or surprise

Home Search Previous day Next day