#archiveteam-bs log for 2022-12-02

Home Search Previous day Next day

00:02:54		NickNick joins
00:20:13		sec^nd quits [Remote host closed the connection]
00:21:57		sec^nd (second) joins
00:39:37		Junior joins
00:40:13		Junior quits [Remote host closed the connection]
00:48:36		HackMii quits [Remote host closed the connection]
00:49:36		HackMii (hacktheplanet) joins
00:55:11		sec^nd quits [Ping timeout: 248 seconds]
00:56:06		BlueMaxima joins
00:58:35		sec^nd (second) joins
01:25:35		mut4ntm0nkey quits [Ping timeout: 248 seconds]
01:26:29		jacobk quits [Ping timeout: 265 seconds]
01:37:15		jacobk joins
01:39:19		mut4ntm0nkey (mutantmonkey) joins
01:57:51		michaelblob quits [Client Quit]
01:59:22		FalconK_ quits [Quit: WeeChat 3.7.1]
02:00:02		FalconK (FalconK) joins
02:06:23		negativegray joins
02:07:37	<negativegray>	Hi! I'm looking for a specific fanfic but don't really have the disk space or the internet speed to comb the WARC batches for it, is this the proper channel to ask if someone has it in an archive?
02:12:21		NickNick quits [Client Quit]
02:23:57		qwertyasdfuiopghjkl quits [Client Quit]
02:34:07		onetruth quits [Client Quit]
02:39:26		michaelblob (michaelblob) joins
02:48:57	<TheTechRobo>	negativegray: which archive are you looking at?
02:49:10	<TheTechRobo>	most official archiveteam stuff is in the wayback machine
02:56:19	<negativegray>	I don't know how to effectively search there, I'm looking at these: https://archive.org/details/archiveteam_fanfiction
02:58:38	<TheTechRobo>	From the collection info (https://archive.org/details/archiveteam_fanfiction?tab=about): "Fanfiction.net Safety Download <https://archive.org/details/fanfic_download_2012_01> is a single 2 GB tar file containing epub files, which may be easier to extract."
03:02:55		sonick quits [Client Quit]
03:02:56	<TheTechRobo>	negativegray: ^
03:03:37	<negativegray>	I've checked that, it doesn't have the complete thing I'm pretty sure
03:04:40	<negativegray>	TheTechRobo: or I'm very bad at searching through it
03:07:02	<TheTechRobo>	negativegray: Assuming it's not in the Wayback Machine (that grab was long before I joined AT so I don't know), you can look through the items' CDX
03:07:10	<TheTechRobo>	For example, for https://archive.org/download/archiveteam-fanfiction-warc-07, there are several cdx.gz files
03:07:20	<TheTechRobo>	Not sure which is the "correct" one but they're a lot smaller than the WARC
03:07:33	<TheTechRobo>	They basically list the WARC's contents, e.g. urls, capture time iirc
03:08:06	<TheTechRobo>	they also list which WARC contains the resource
03:08:34	<negativegray>	TheTechRobo: oooh, thank you! How do I open a cdx?
03:09:10	<TheTechRobo>	negativegray: It's just text, and there's plenty of documentation.
03:09:13	<TheTechRobo>	let me see if I can find some.
03:10:27	<TheTechRobo>	negativegray: Here you go! The first line of CDX is the legend, and it has letters that correspond to what the value is representing. I think it's space separated.
03:10:31	<TheTechRobo>	Here's the letter list: https://archive.org/web/researcher/cdx_legend.php
03:10:48	<negativegray>	TheTechRobo: thank you!
03:10:49	<TheTechRobo>	Not all letters will be present.
03:12:12	<negativegray>	TheTechRobo: I tried reading the .cdx and it did not help me, even with the legend
03:12:34	<TheTechRobo>	Hang on let me download it my internet is slow
03:12:38	<TheTechRobo>	I may have to go before it finishes
03:13:46	<negativegray>	okay!
03:15:06	<TheTechRobo>	negativegray: while it downloads, what information do you have about the fanfic?
03:15:16	<TheTechRobo>	do you have the URL? or do you need a full-text search?
03:15:32	<TheTechRobo>	if the latter, CDX won't work for oyu
03:18:15	<negativegray>	TheTechRobo: yeah I need a full text search. I have the author's name and the fanfic's name
03:18:18	<negativegray>	or title
03:18:28	<TheTechRobo>	In that case, yeah, CDX probably won't help you. :/
03:18:36	<TheTechRobo>	Unless the url contains the title or something.
03:19:12	<negativegray>	yeah
03:19:15	<negativegray>	ty though
03:20:11	<TheTechRobo>	I'm not sure what you can do in that case. Does anybody have the fanfic warcs downloaded?
03:20:18	<TheTechRobo>	I have to go to bed btw, good night!
03:20:38	<negativegray>	good night!
03:23:32	<Doranwen>	negativegray: out of curiosity, what fandom?
03:23:38	<negativegray>	Harry Potter
03:23:58	<Doranwen>	Ah, I wouldn't have it. Could ask some friends of mine, though.
03:24:24	<Doranwen>	We've got a Discord server where we share info on deleted fics we have.
03:25:52	<negativegray>	oh!
03:25:57	<negativegray>	That'd be great!
03:26:06	<negativegray>	It is in portuguese, though
04:04:22	<negativegray>	okay! I got an URL for the author and the fic!
04:06:07	<negativegray>	I can only access the first chapter, though
04:11:10		eroc1990 quits [Remote host closed the connection]
04:11:35		eroc1990 (eroc1990) joins
04:19:02		march_happy quits [Ping timeout: 268 seconds]
04:19:20		march_happy (march_happy) joins
04:20:35	<negativegray>	gods, being so close hurts. I managed to get to the wayback machine page of the first chapter, but it seems to be the only one that there is on cache
05:13:06		pabs quits [Ping timeout: 276 seconds]
05:22:08		pabs (pabs) joins
05:30:20		negativegray quits [Remote host closed the connection]
06:50:48		pabs quits [Ping timeout: 265 seconds]
06:52:24		Island quits [Read error: Connection reset by peer]
06:53:42		pabs (pabs) joins
07:20:14		hitgrr8 joins
07:34:52		sonick (sonick) joins
08:00:41		BlueMaxima quits [Read error: Connection reset by peer]
10:01:15		DiscantX joins
10:07:15		Doomaholic joins
10:10:23		DiscantX quits [Client Quit]
10:11:06		DiscantX joins
10:15:37	<schwarzkatz\|m>	Can you share a link,
10:16:41	<schwarzkatz\|m>	Damn it, didn’t mean to send so early.
10:16:41	<schwarzkatz\|m>	negativegray: can you share a link please?
10:29:06	<h2ibot>	Arkiver uploaded File:Buzzvideo-logo.png: https://wiki.archiveteam.org/?title=File%3ABuzzvideo-logo.png
10:30:04		sec^nd quits [Remote host closed the connection]
10:30:06	<h2ibot>	Arkiver uploaded File:Buzzvideo-icon.png: https://wiki.archiveteam.org/?title=File%3ABuzzvideo-icon.png
10:31:57		sec^nd (second) joins
11:20:07		march_happy quits [Remote host closed the connection]
11:22:47		march_happy (march_happy) joins
11:31:27		sec^nd quits [Ping timeout: 248 seconds]
11:32:55		sec^nd (second) joins
12:07:34		le0n quits [Quit: see you later, alligator]
12:09:04		le0n (le0n) joins
12:28:18	<@JAA>	(They left hours ago.)
12:39:42	<schwarzkatz\|m>	ah. is that something only admins see?
12:40:29	<@JAA>	I have no idea what Matrix does with that information, but on IRC, anyone can see it.
12:41:21	<schwarzkatz\|m>	hm, weird.
12:41:28	<joepie91\|m>	I've been noticing that parts aren't bridging correctly lately
12:41:31	<joepie91\|m>	I suspect a bridge bug
13:08:45		Arcorann quits [Ping timeout: 268 seconds]
13:13:19		Ketchup901 quits [Ping timeout: 248 seconds]
13:16:02		Ketchup901 (Ketchup901) joins
13:57:43		spirit joins
14:01:12		gazorpazorp quits [Read error: Connection reset by peer]
14:10:00		Ketchup901 quits [Remote host closed the connection]
14:10:42		Ketchup901 (Ketchup901) joins
14:12:21		Ketchup901 quits [Remote host closed the connection]
14:13:22		Ketchup901 (Ketchup901) joins
14:22:39		sec^nd quits [Ping timeout: 248 seconds]
14:23:12		sec^nd (second) joins
14:27:34		VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
14:29:47		VerifiedJ (VerifiedJ) joins
14:47:57	<Frogging101>	Is yt-dlp able to download a YouTube channel that has more videos than the page limit?
15:12:54		sonick quits [Client Quit]
15:29:14	<JTL>	can you provide an example channel?
15:30:34	<Doranwen>	schwarzkatz\|m: They were looking for https://www.fanfiction.net/s/1888034/1/. It's not in the FanficRepack_Redux collection, which a friend of mine suggested looking in.
15:32:51		Island joins
15:34:07		march_happy quits [Remote host closed the connection]
15:35:08		march_happy (march_happy) joins
15:37:55		march_happy quits [Remote host closed the connection]
15:38:43		march_happy (march_happy) joins
15:45:14		HP_Archivist (HP_Archivist) joins
16:01:12	<@JAA>	Doranwen: Do we have any idea when it was deleted?
16:01:22	<@JAA>	The WBM snapshot is from 2005.
16:06:39		mut4ntm0nkey quits [Ping timeout: 248 seconds]
16:06:59		mut4ntm0nkey (mutantmonkey) joins
17:24:03		qwertyasdfuiopghjkl joins
18:04:05		DLoader quits [Ping timeout: 265 seconds]
18:12:55		DLoader joins
18:13:59		upintheairsheep joins
18:14:32	<upintheairsheep>	Hello, I would like to learn what tool https://archive.org/details/TikTok?tab=about is scraped by
18:15:07	<@arkiver>	internal, not related to IA
18:15:13	<upintheairsheep>	I know a lot about the comment API and the replies API
18:15:42	<upintheairsheep>	So is the ArchiveTeam not behind it?
18:15:45	<@arkiver>	no
18:16:38		upintheairsheep leaves
18:17:05	<spirit>	NEXT!
18:17:34	<@arkiver>	:P
18:18:06		upintheairsheep joins
18:18:47	<upintheairsheep>	To remind you, TikTok is going to remove videos related to tanning after warning from medical experts. https://www.theguardian.com/technology/2022/dec/01/tiktok-to-ban-videos-that-encourage-sunburn-and-tanning-after-alarm-from-medical-experts
18:22:03	<upintheairsheep>	Tag and videos still seem to be up: https://www.tiktok.com/tag/sunburnchallenge?lang=en
18:22:51	<upintheairsheep>	Other tags of interest: https://www.tiktok.com/tag/sunburn https://www.tiktok.com/tag/tanning https://www.tiktok.com/tag/sunbathing
18:32:53		upintheairsheep quits [Remote host closed the connection]
18:37:38		hackbug (hackbug) joins
19:25:48		systwi_ (systwi) joins
19:27:51		systwi quits [Ping timeout: 276 seconds]
19:34:12	<TheTechRobo>	How do you reverse engineer the requests that a Steam game makes? I was thinking of a proxy, but as far as I'm aware you can't configure its use.
19:34:25	<TheTechRobo>	Wireshark's fine but it captures ALL traffic...
19:35:04	<schwarzkatz\|m>	it has powerful filtering tho
19:36:50	<TheTechRobo>	I don't know how to use it xD
19:37:47	<TheTechRobo>	I might be able to guess at the domain name, though. Is there a way to do that for wireshark?
19:37:54	<TheTechRobo>	Or guess at part of the domain name, at leasty.
19:38:01	<TheTechRobo>	(I know both the company and game name)
19:38:31	<schwarzkatz\|m>	related documentation:
19:38:31	<schwarzkatz\|m>	https://www.wireshark.org/docs/wsug_html_chunked/ChCapCaptureFilterSection.html
19:38:31	<schwarzkatz\|m>	https://www.tcpdump.org/manpages/pcap-filter.7.html
19:38:35	<TheTechRobo>	Wireshark also isn't great for HTTP because it just gets the raw TCP data, no? There's likely ssl.
19:39:23	<schwarzkatz\|m>	you'd need to use https://docs.mitmproxy.org/stable/ then I guess :D
19:40:43	<@JAA>	Depending on how the game validates TLS certs, it might be messy though.
19:41:05	<@JAA>	If it has its own cert store or hardcoded fingerprints or similar, for example.
19:41:36	<@JAA>	Then you'll need to either replace that (have fun) or use something like tcpdump/Wireshark and extract the master key (also fun).
19:41:47	<@JAA>	pre-master key*
19:42:01	<schwarzkatz\|m>	if everything goes through mitmproxy though, why would it be messy?
19:42:30	<@JAA>	Because the client (game) needs to trust mitmproxy's CA cert for that to work.
19:46:57	<@JAA>	If it uses the system trust store, that's easy, but if it doesn't, mess.
19:47:42	<TheTechRobo>	Is there a linux way to get the traffic from a specific process given its PID?
19:47:45	<@JAA>	See also: you can't make browsers accept mitmproxy by adding the CA cert to the system trust store. Need to do it separately in the browser.
19:48:07	<schwarzkatz\|m>	that... sucks. I thought it was system wide.
19:50:04	<@JAA>	TheTechRobo: Maybe some iptables magic would help here, but not sure.
19:53:20	<@JAA>	Stack Exchange suggests strace, network namespaces, and iptables: https://askubuntu.com/questions/11709/how-can-i-capture-network-traffic-of-a-single-process
20:21:46		sudofox joins
20:22:59	<sudofox>	hiya. i'm looking for some tool recommendations. so i've been trying to archive all static assets from some websites i'm interested in for personal curiosity. i decided to finally give archiving user content from one of them a shot, but it kinda breaks my normal workflow of "try many URLs and git commit whatever i found" due to the sheer # of files
20:23:34	<sudofox>	i've started using git lfs but the reason i'm using git is mainly to actually see how much progress i've made/new things found each time i try something
20:24:21	<sudofox>	i'm wondering if there's a better tool to track progress with recovered files -- i'm also committing tooling for guessing filenames at the same time
20:24:49	<sudofox>	i guess i could use S3 but I still like being able to see what's new with `git status` and so on.
20:25:40	<sudofox>	also git lfs kinda duplicates objects into .git/lfs so double disk space
20:32:56	<@JAA>	Yeah, you'll want to get away from 'one file per asset' anyway probably. It just doesn't scale. Eventually, your file system will be sad as well.
20:34:03	<sudofox>	eh, yeah, you're right -- key-based object storage is probably much better for this stuff
20:34:04	<@JAA>	One route is WARC, but accessibility isn't exactly great with it.
20:34:45	<sudofox>	i've been thinking about building a little ceph server in my basement for a while for that purpose (instead of using Amazon)
20:35:12	<@JAA>	You get extra metadata and a technically more accurate capture that way, too.
20:35:30	<@JAA>	I suppose that would work as well, yeah.
21:22:05		systwi (systwi) joins
21:22:54		systwi_ quits [Ping timeout: 276 seconds]
22:07:11		sonick (sonick) joins
22:29:57	<Doranwen>	JAA: No, he never mentioned that. Left his Reddit nick with me but that's all I've got. Oh well, lol.
23:02:39		hitgrr8 quits [Client Quit]
23:06:39		sudofox quits [Ping timeout: 265 seconds]
23:26:47		fishingforsoup_ quits [Quit: Leaving]
23:27:04		fishingforsoup joins
23:40:42		Arcorann (Arcorann) joins
23:48:24		jacksonchen666 (jacksonchen666) joins
23:56:23		sudofox joins

Home Search Previous day Next day