#archiveteam-bs log for 2023-05-19

Home Search Previous day Next day

00:03:49		nexusxe (nexusxe) joins
00:13:08		decky_e quits [Ping timeout: 252 seconds]
00:13:33		decky_e (decky_e) joins
00:30:08		nexusxe quits [Client Quit]
00:41:01		sonick (sonick) joins
01:14:59		cascode joins
01:21:02	<joepie91\|m>	people being their usual fascist selves, as far as I can tell
01:22:24	<joepie91\|m>	Elon found a new thing to be conspiratorial about (the Internet Archive) and his followership has latched on, with the not insignificant proportion of fascists among them having interpreted it as an invitation to think up antisemitic conspiracy theories, spread transphobic rhetoric, etc.
01:22:38	<joepie91\|m>	this is a fairly clockwork thing with Elon unfortunatelyu
01:23:06	<joepie91\|m>	just this time it's IA in the crosshairs
01:24:00	<JTL>	Not like IA doesn't have enough problems already :x
01:25:32	<joepie91\|m>	yeah...
01:26:13		joepie91\|m mumbles something about fascists torching libraries historically
01:46:00		dumbgoy_ joins
01:49:23		dumbgoy quits [Ping timeout: 252 seconds]
02:08:39	<TheTechRobo>	Love the Elon stans in the replies. People say that anyone can exclude their own content from IA, then a bunch of people ask "WELL THEN WHY WAS THE CONTENT REMOVED"
02:09:04	<TheTechRobo>	How have humans survived this long?
02:17:23	<nicolas17>	TheTechRobo: I bet the stans also said Elon was 100% right in firing the people he fired and that Twitter is best off without them
02:17:34	<nicolas17>	I'd like to see them justify this now https://www.businessinsider.com/elon-musk-says-twitter-probably-rehire-some-laid-off-staff-2023-5
02:22:56		TheTechRobo quits [Ping timeout: 252 seconds]
02:24:02		TunaLobster joins
02:24:18		TheTechRobo (TheTechRobo) joins
02:24:25		cascode quits [Ping timeout: 265 seconds]
02:25:52		pabs quits [Ping timeout: 265 seconds]
02:26:31		pabs (pabs) joins
02:27:54		cascode joins
02:47:56		seefoe joins
02:49:05		decky_e quits [Remote host closed the connection]
02:49:38		decky_e (decky_e) joins
02:50:16	<seefoe>	Hey all, just wanted to check on the status of the warrior tracker for the imgur project. My warrior instance has been reporting "No HTTP response received from tracker. The tracker is probably overloaded." for the past 45 mins
02:50:18		cascode quits [Read error: Connection reset by peer]
02:50:39		cascode joins
02:52:12	<seefoe>	Looks like the leader-boards aren't loading for me either
02:53:48	<andrew>	seefoe: the tracker being down is known, but admins are asleep
02:54:06	<seefoe>	ahhh, okay, glad to hear it isn't just me -- thanks andrew
03:00:02		seefoe quits [Client Quit]
03:00:18		seefoe joins
03:01:49	<icedice>	<TheTechRobo> Love the Elon stans in the replies. People say that anyone can exclude their own content from IA, then a bunch of people ask "WELL THEN WHY WAS THE CONTENT REMOVED"
03:03:07	<icedice>	Seems on brand. These are the people who go >muh freedumb of speech whenever some community tells them to fuck off and meanwhile they think that getting rid of Section 230 so that platforms will be liable for everything users posts is a great idea to get back on big tech
03:03:46	<icedice>	They don't think far enough that that would actually mean a lot more censorship, and not of the kind that is actually warranted
03:04:25	<icedice>	And that they'd be the first people who'd get fucked over by it since they're the biggest assholes around that can't avoid breaking rules
03:05:10	<icedice>	Thinking isn't exactly their strong suit
03:05:52	<icedice>	And Elon the man-child will do whatever he thinks will please his new fanbase
03:06:07	<andrew>	it's so unbelievably stupid, with these fascist influencers twisting themselves into pretzels calling Elon's claims "true"
03:06:15	<andrew>	(in the replies)
03:07:26		dumbgoy_ quits [Ping timeout: 265 seconds]
03:13:14	<icedice>	https://www.memeatlas.com/images/brainlets/brainlet-crew-colorful.jpg
03:13:30	<icedice>	^ Visual representation of Elon and his fanbase
03:23:07	<pokechu22>	It's odd that both the archivebot control node and the warrior tracker for #imgone both died
03:26:34		katocala quits [Ping timeout: 252 seconds]
03:27:12		katocala joins
03:34:04		BlueMaxima joins
03:53:23		katocala is now authenticated as katocala
04:06:33	<@arkiver>	Elon Musk spread a conspiracy theory that has unfortunately been around for some time.
04:06:48	<@arkiver>	I would like to refer to https://twitter.com/brewster_kahle/status/1659283393753006082
04:07:59	<@arkiver>	I also want to ask everyone to please not have political discussions here. Archive Team is not the place for that.
04:10:34		cascode quits [Read error: Connection reset by peer]
04:10:42		cascode joins
04:11:01		cascode quits [Read error: Connection reset by peer]
04:11:23		cascode joins
04:15:38		eroc19901 (eroc1990) joins
04:15:41		eroc1990 quits [Ping timeout: 252 seconds]
04:20:06		cascode quits [Ping timeout: 252 seconds]
04:22:47		cascode joins
04:36:47		decky_e quits [Remote host closed the connection]
04:37:12		decky_e joins
04:46:08		tsblock (tsblock) joins
05:04:32		archivist99 joins
05:04:38		GNU_world quits [Ping timeout: 252 seconds]
05:06:30		archivist99 is now known as GNU_world
05:23:20		Island_ quits [Read error: Connection reset by peer]
05:30:29	<seefoe>	looks like it's back up
05:33:04		cascode quits [Ping timeout: 252 seconds]
05:35:56		archivist99 joins
05:35:57		fullpwnmedia quits [Read error: Connection reset by peer]
05:36:11		fullpwnmedia joins
05:36:18		GNU_world quits [Ping timeout: 265 seconds]
05:51:27		cascode joins
06:01:46		Unholy2361 quits [Quit: Ping timeout (120 seconds)]
06:05:01		archivist99 is now known as GNU_world
06:13:20		hitgrr8 joins
06:43:48		killsushi joins
06:44:12		seefoe quits [Ping timeout: 252 seconds]
06:45:07		seefoe joins
06:54:55		BlueMaxima quits [Client Quit]
06:55:12		cascode quits [Ping timeout: 252 seconds]
06:55:46		cascode joins
07:13:32		cascode quits [Ping timeout: 252 seconds]
07:14:31		spirit quits [Client Quit]
08:12:20	<Ryz>	Huh...? Just the English and Chinese language version of Niconico is going to be discontinued at the end of 2023 June? https://blog.nicovideo.jp/niconews/192660.html
08:12:30	<Ryz>	Might need investigating if there's anything more than that ><;
08:14:02	<Exorcism\|m>	whaaaat, but what is the point of removing translations 😭
08:36:58		spirit joins
08:51:03		decky_e quits [Read error: Connection reset by peer]
08:57:13		decky_e (decky_e) joins
09:23:29		Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
09:23:42		Minkafighter joins
09:54:52		umgr036 joins
09:55:43		umgr036 quits [Remote host closed the connection]
09:55:56		umgr036 joins
10:10:39		user__ joins
10:13:15		umgr036 quits [Ping timeout: 265 seconds]
10:34:23		tsblock quits [Read error: Connection reset by peer]
10:36:57		decky_e quits [Remote host closed the connection]
10:40:45		s-crypt quits [Remote host closed the connection]
10:40:45		flashfire42 quits [Remote host closed the connection]
10:40:45		kiska quits [Remote host closed the connection]
10:40:45		Ryz2 quits [Remote host closed the connection]
10:41:04		Ryz2 (Ryz) joins
10:41:06		s-crypt (s-crypt) joins
10:41:09		flashfire42 (flashfire42) joins
10:42:18		kiska (kiska) joins
10:57:38		BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
11:02:06	<icedice>	Right, sorry arkiver
11:02:56	<icedice>	Musk just annoys the hell out of me, especially when going after organizations like Internet Archive that just try to make things better
11:03:13	<icedice>	I'll leave it at that
11:08:14		BearFortress joins
11:09:33	<icedice>	Is the Imgur API still allow grabbing 12 500 downloads per day and 500 downloads per hour or have they limited that recently?
11:14:10		tbc1887_ quits [Client Quit]
11:14:30		tbc1887 (tbc1887) joins
11:22:47	<@OrIdow6>	Anyone looked into Splice Studio (on DW, the 31st)? I think that section of the website is login-restricted but I'm not entirely sure
11:42:11	<h2ibot>	TheTechRobo edited Strawpoll.me (+19, Link to my uploaded data): https://wiki.archiveteam.org/?diff=49804&oldid=49414
12:21:29		spirit quits [Client Quit]
12:39:19		sonick quits [Client Quit]
13:24:48	<Thibaultmol>	so which warrior/grab projects are currently actively getting jobs?
13:24:48	<Thibaultmol>	imgur, telegram, reddit, urls
13:24:48	<Thibaultmol>	Am I missing any?
13:29:29		lennier1 quits [Ping timeout: 265 seconds]
13:30:33		lennier1 (lennier1) joins
13:48:04		Unholy2361 (Unholy2361) joins
13:49:35		umgr036 joins
13:49:53		user__ quits [Ping timeout: 252 seconds]
14:03:02	<nstrom\|m>	urlteam is kinda always active but probably has way more workers than work
14:08:15	<icedice>	What's going on with the ArchiveBot tracker being down?
14:41:14		Arcorann quits [Ping timeout: 252 seconds]
15:04:03		spirit joins
15:23:07	<@kaz>	not a lot
15:23:13	<@kaz>	is down, will be back in the future no doubt
15:24:06		Unholy23614 (Unholy2361) joins
15:27:47		Unholy2361 quits [Ping timeout: 252 seconds]
15:27:47		Unholy23614 is now known as Unholy2361
15:31:19		Island joins
15:58:10		BigBrain_ (bigbrain) joins
15:58:21		BigBrain quits [Ping timeout: 245 seconds]
16:06:02	<icedice>	Ok
16:06:03		icedice quits [Client Quit]
16:28:41		eroc19901 is now known as eroc1990
16:45:53		jspiros quits [Ping timeout: 252 seconds]
16:46:47		jspiros (jspiros) joins
17:01:01		dumbgoy_ joins
17:31:46		decky_e (decky_e) joins
17:35:50		tbc1887 quits [Read error: Connection reset by peer]
17:43:41	<mikolaj\|m>	I've uploaded forum-dl v0.1.0 to PyPI (can be installed with pip install forum-dl now)
17:44:58	<mikolaj\|m>	if anyone's interested in testing it and leaving a comment, feature request, or a bug report -- I'd appreciate that greatly
17:45:08	<mikolaj\|m>	the repository is here, of course: https://github.com/mikwielgus/forum-dl
17:45:45	<mikolaj\|m>	it cannot dump WARCs yet, unfortunately, it's a priority for v0.2.0
17:46:47		that_lurker quits [Client Quit]
17:47:40		that_lurker (that_lurker) joins
17:48:58		decky_e quits [Ping timeout: 252 seconds]
17:49:23		decky_e (decky_e) joins
18:10:39		Miori joins
19:11:01		HP_Archivist (HP_Archivist) joins
19:12:39		decky_e quits [Ping timeout: 265 seconds]
19:18:57		decky_e (decky_e) joins
19:38:34	<andrew>	is there any way to have grab-site re-crawl some section of a site without re-downloading other pages that have already been downloaded?
19:43:43	<@JAA>	andrew: Depends on your definition of 'section'. If it's a certain path, the easiest way would just be to run a separate crawl starting from that. If it's less easily separated, a full crawl with broad ignores might work. You could also mess with the wpull DB to mark URLs as todo again and then resume the crawl, but you might want to do that in a separate dir with a copy of the data.
19:46:24	<andrew>	JAA: wait, you can resume a crawl?
19:47:45	<@JAA>	Well, sort of, but not properly.
19:48:46	<@JAA>	You can manually run wpull (using the command generated by grab-site with some option), but it has problems. Cookies will be blank for the restart, for one.
19:49:42	<@JAA>	Not sure if grab-site makes wpull write out cookies when it finishes normally nor what happens if you stop a crawl.
19:49:59	<@JAA>	Suffice to say that it's a messy process.
19:59:16	<andrew>	JAA: how badly will grab-site barf if I give it a 50 million line ignore list?
20:00:26	<@JAA>	andrew: Not sure if re2 has any pattern size limits, but other than that, you'd just be limited by RAM size I guess. I bet the performance would be abysmal though.
20:00:34	<andrew>	or would it be better to start a crawl then import the URL list into wpull.db
20:03:05	<@JAA>	I suppose that would be another option, yeah.
20:04:25	<andrew>	ERROR Fatal exception.
20:04:25	<andrew>	Traceback (most recent call last):
20:04:25	<andrew>	File "/nix/store/sgpdf09ql5ik9zvdw8dgxplqgp5k02h9-python3.8-SQLAlchemy-1.3.24/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
20:04:25	<andrew>	self.dialect.do_execute(
20:04:25	<andrew>	File "/nix/store/sgpdf09ql5ik9zvdw8dgxplqgp5k02h9-python3.8-SQLAlchemy-1.3.24/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
20:04:26	<andrew>	cursor.execute(statement, parameters)
20:04:26	<andrew>	sqlite3.DatabaseError: database disk image is malformed
20:04:42	<andrew>	oops, I probably shouldn't have tried poking around the database while it was running
20:05:44		umgr036 quits [Remote host closed the connection]
20:05:58		umgr036 joins
20:06:13	<andrew>	well shit, what do I do now to resume this crawl? :P
20:13:19	<@JAA>	RIP
20:13:51	<andrew>	welp, I guess it's time to see what re2 can do :P
20:14:01	<andrew>	WCGW?
20:59:15	<andrew>	update: it seems re2 does not like my 754k line ignores file, it has printed "/build/source/re2/simplify.cc:225: CoalesceWalker::ShortVisit called" to the console thousands of times
21:00:07	<@JAA>	<surprised_pikachu.png>
21:00:35	<andrew>	/ Should never be called: we use Walk(), not WalkExponential().
21:00:39	<andrew>	well that's comforting
21:00:42	<@JAA>	Yeah, just saw that as well. lol
21:00:47		Unholy2361 quits [Client Quit]
21:01:08		Unholy2361 (Unholy2361) joins
21:04:34	<andrew>	great, now Sublime Text (UNREGISTERED) is not responding
21:07:48	<joepie91\|m>	arkiver: honestly, I don't think that "don't have political discussions" is the right policy here. what Archive Team does is fundamentally political, and political circumstances affect both Archive Team and the Internet Archive (as is obvious from this case of Musk targeting IA, for example). this has also historically been true, with eg. institutions of knowledge and information often being the first to go under authoritarian rule. I can
21:07:48	<joepie91\|m>	understand that certain rhetoric, behaviour or viewpoints may not be welcome here, but "no political discussions" seems to me an extremely poor capture of that idea, and one that indirectly makes it impossible to discuss the very real threats that affect not just IA but also the people here.
21:08:21	<joepie91\|m>	if there's certain rhetoric/views/etc. that are not wanted here, then I would suggest calling those by their name instead
21:08:48	<joepie91\|m>	(I'm not opposed to that, to be clear - just arguing that trying to phrase it as "no politics" has unwanted collateral effects)
21:12:58	<fireonlive>	754k!
21:13:17	<fireonlive>	i feel better about my 42 rules now
21:13:29	<andrew>	surprisingly, Notepad++ is handling this much more gracefully
21:16:22	<masterX244>	when you stomp onto unexpected bugs due to oversized data
21:17:08	<andrew>	Notepad++ has some graphical glitches with the length of these lines :P
21:18:32	<andrew>	keystroke latency is a perfectly fine 7 seconds
21:18:38		TunaLobster quits [Read error: Connection reset by peer]
21:19:44	<andrew>	okay, I optimized the regex down to 30 megabytes
21:19:49	<andrew>	let's see if it works
21:20:36	<andrew>	okay, turns out terminals don't like it when you print 18 megabytes on a single line
21:20:52	<CrispyAlice2>	who woulda thought
21:21:10	<threedeeitguy>	OoO
21:22:37	<fireonlive>	mikolaj\|m: very interesting, thank you!
21:22:37	<andrew>	grab-site quickly hit 3.1 GiB of memory
21:22:52	<andrew>	oh come on - /build/source/re2/simplify.cc:225: CoalesceWalker::ShortVisit called
21:22:56	<masterX244>	seems like the bot is still munching
21:23:20	<masterX244>	wrong chat, shit
21:24:19	<andrew>	hey maybe it worked, wpull.db is filling up
21:27:10	<joepie91\|m>	hm, oops, apparently the aforementioned topic had moved to -ot and I didn't notice
21:28:20		Billy549 (Billy549) joins
21:35:39		seefoe quits [Client Quit]
21:48:43		seefoe joins
22:01:38	<andrew>	okay, wow, this regex performs so poorly the HTTP requests seem to be timing out (?)
22:02:09	<andrew>	this is not good
22:03:15	<spirit>	maybe there are some tools to build more efficient expressions from a big corpus like that
22:03:24	<spirit>	if there is any chance :D
22:03:28	<@JAA>	I was wondering how long it would take to match even a single URL.
22:03:30	<andrew>	oh, what if I passed --exclude-directories to wpull?
22:19:48	<fireonlive>	are transfer.archivete.am files expired after a while?
22:20:40		seefoe is now known as rohvani
22:21:13	<@JAA>	fireonlive: No
22:21:24	<fireonlive>	kk :)
22:21:45	<fireonlive>	this is archiveteam afterall i suppose haha
22:22:03	<masterX244>	we don't advertise it to avoid abuse
22:22:45	<@JAA>	Yeah, the homepage also mentions a size limit that doesn't exist.
22:23:32	<imer>	i call bs on that one JAA, i may have accidentally tried to upload non-zstd files and it timed out :p
22:24:15	<imer>	although that might be peering limitations, can't seem to get it to upload faster than ~50mbit/s
22:24:17	<pokechu22>	oh, yeah, the front page claims "Files stored for 1 day" which is definitely not the case
22:24:46	<pokechu22>	I also like the "contact us" link that goes nowhere
22:24:50	<@JAA>	imer: Yeah, but that's a bug with the CDN, not an actual restriction.
22:25:11	<imer>	well, same difference
22:25:35	<@JAA>	There is a workaround that allows uploading arbitrarily large files. It's just not advertised publicly.
22:25:42	<imer>	probably a good thing
22:33:05		hitgrr8 quits [Client Quit]
22:34:49	<fireonlive>	makes sense
22:37:10		killsushi quits [Ping timeout: 252 seconds]
23:04:09		BlueMaxima joins
23:24:18	<fullpwnmedia>	nicolas17 is there actually a random video on the server?
23:25:15	<fullpwnmedia>	JAA how'd you go with the pulling from the dynabook ftp?
23:26:37	<@JAA>	fullpwnmedia: I just threw it into ArchiveBot. It should be in the Wayback Machine by now.
23:26:53	<fullpwnmedia>	all of it?
23:27:13	<nicolas17>	fullpwnmedia: Upload/74580960.mp4
23:27:22	<nicolas17>	added may 16
23:27:25	<@JAA>	Is something missing?
23:27:37	<nicolas17>	it's the only change since I first mirrored it
23:28:04	<fullpwnmedia>	JAA it should be good. thanks for that
23:28:16	<fullpwnmedia>	nicolas17 whats the ftp url again?
23:28:38	<nicolas17>	https://uk.dynabook.com/generic/general-new-ftp-and-software-guide-sheets/
23:29:10	<nicolas17>	here's the video https://transfer.archivete.am/LMo8u/74580960.mp4
23:29:47	<fullpwnmedia>	istg if its some nsfw shit
23:30:00	<fullpwnmedia>	nvm lmao
23:30:18	<fullpwnmedia>	thats the most random video ive seen in a while
23:30:30	<nicolas17>	it's actually a dynabook laptop tho
23:30:48	<fullpwnmedia>	it is but its funny that someone just put it there
23:31:08	<fullpwnmedia>	i wonder if it was done by support
23:33:33	<fullpwnmedia>	JAA sorry, how do i access the files on the wayback like the directory list? do i just put the same user and password on the prompt on wayback?
23:35:06	<@JAA>	fullpwnmedia: You can't really access the dir listing. Username and password are ignored by the WBM (as is the protocol, actually). But https://web.archive.org/web/20230507/ftp://www.tb2b.eu/ should help.
23:35:39	<fullpwnmedia>	gotcha gotcha. so all i need is just the file path
23:36:28	<@JAA>	The dir listing would in theory be available at https://web.archive.org/web/20230507035244/ftp://www.tb2b.eu/ but the WBM forces it to a download instead of displaying the contents.
23:36:37	<nicolas17>	I have 6973 files+directories here
23:36:57	<fullpwnmedia>	fair enough
23:36:59	<nicolas17>	that WBM page says there are 6977 captured URLs so it should be complete, but I'm left wondering what the 5 extra are
23:37:07	<nicolas17>	er 4 extra I can't math today
23:37:16	<pokechu22>	view-source:https://web.archive.org/web/20230507035244id_/ftp://www.tb2b.eu/
23:37:52	<fullpwnmedia>	nicolas17 could it be the html for the directory listings?
23:38:04	<pokechu22>	at least in firefox, that bypasses it being downloaded
23:38:07	<@JAA>	Oh right, view-source bypasses the download forcing.
23:38:20	<nicolas17>	well I'm counting directories too
23:38:28	<@JAA>	More accurately, the WBM doesn't actually force anything, it just serves it with 'content-type: application/octet-stream'.
23:38:33	<andrew>	so update on the grab-site exclusion thing - I found the source code that handles --exclude-directories and it looks prohibitively slow for the number of exclusions I have
23:38:37	<@JAA>	And the browser generally won't know what to do with that.
23:38:48	<pokechu22>	but you also need to change the url to specify id_ or if_ because otherwise you just get the source for the WBM outer frame: view-source:https://web.archive.org/web/20230507035245/ftp://www.tb2b.eu/Deployment_Files/
23:38:59	<fullpwnmedia>	youre a real one for that
23:49:02		Nulo_ joins
23:51:44		Nulo_ quits [Read error: Connection reset by peer]
23:51:49		Nulo_ joins
23:52:01		Nulo quits [Ping timeout: 265 seconds]
23:52:02		Nulo_ is now known as Nulo

Home Search Previous day Next day