#archiveteam-bs log for 2022-09-04

Home Search Previous day Next day

00:08:55		fishingforpie quits [Remote host closed the connection]
00:09:19		fishingforpie joins
00:17:42		sec^nd quits [Remote host closed the connection]
00:22:54		sec^nd (second) joins
00:24:51		pcr (pcr) joins
00:35:46		sec^nd quits [Ping timeout: 240 seconds]
00:42:26		sec^nd (second) joins
00:45:29		Arcorann (Arcorann) joins
00:51:58		fishingforpie quits [Remote host closed the connection]
01:01:23		fishingforpie joins
01:12:19		fishingforpie quits [Remote host closed the connection]
01:43:13	<systwi_>	https://kiwifarms.ru/ might be a mirror of https://kiwifarms.net/ , but I can't access it because of "DDOS-GUARD."
01:43:35	<systwi_>	I'm assuming AB will have the same problem.
01:53:10	<systwi_>	Does anyone have any suggestions on scraping a broken, but loaded, Firefox page for hrefs?
01:54:03		katocala quits [Remote host closed the connection]
01:54:12	<systwi_>	I can neither pull up developer tools, view the page source, use Link Gopher to scrape for links, "print" it to a PDF nor save it to an HTML/_files.
01:54:45	<systwi_>	The last resort I'm thinking of is OCR, which would _not_ be fun. This page is massive.
01:56:11	<systwi_>	Or, alternatively, is there a way to get a full list of a Tumblr post's "notes?"
01:56:56	<@JAA>	Probably with a bit of scripting around the pagination.
01:58:51	<systwi_>	I tried looking into that and found a few questions on Stack Exchange which didn't seem to help. I'm guessing Tumblr changed how pagination works on /notes/ pages (both w/ and w/o JS) since then.
01:59:22	<systwi_>	Or _maybe_ it was some strange cache thing in Firefox? I didn't verify that.
01:59:43	<systwi_>	Either way, the "Show more notes" link would always be the full /notes/ URL with # suffixed.
02:01:07	<systwi_>	URL I was trying, for reference: https://sierns.tumblr.com/notes/661611184805527552/kX3iH4VUO
02:01:28	<@JAA>	The pagination URLs are in some JS mess.
02:01:36	<@JAA>	But should be easy enough.
02:02:17	<systwi_>	Oh yeah, there was this portion of the URL buried in some JS: ?from_c=1656280336
02:02:27	<systwi_>	but the number didn't appear exploitable.
02:04:09	<@JAA>	It's a timestamp IIRC, but yeah, basically you need to extract it to work properly.
02:05:32		katocala joins
02:06:02		katocala is now authenticated as katocala
02:07:09	<@JAA>	Second, writing an ugly Python one-liner. :-)
02:07:32	<systwi_>	I tried that last time but it still suffixed the #. Now, looking in the same spot as last time, I see the next URL, ending with ?from_c=1654705021.
02:08:30	<systwi_>	(or doing it the less-cool way, trying to have a bash script do the same :-P )
02:08:45	<@JAA>	Yeah, but then you don't get connection reuse, so it's slow.
02:09:01	<@JAA>	Or well, slower than it needs to be.
02:09:54	<systwi_>	I feel this is going to get quite involved, considering my current Python skill set. x_x
02:13:24	<@JAA>	`python3 -c 'import itertools, re, requests, sys, urllib.parse; s = requests.Session(); url = sys.argv[1]; {print(f"Fetching {url}", file = sys.stderr) or print((r := s.get(url)).text) or ((m := re.search(r"\x27/notes/\d+/[A-Za-z0-9]+\?from_c=\d+(?=\x27)", r.text)) and (url := urllib.parse.urljoin(url, m.group(0)[1:]))) or 1/0 for _ in itertools.count()}'
02:13:29	<@JAA>	https://sierns.tumblr.com/notes/661611184805527552/kX3iH4VUO`
02:13:42	<@JAA>	Blows up with a ZeroDivisionError because I'm too lazy to terminate that properly now.
02:14:07	<@JAA>	Output of all notes pages goes to stdout, progress/URLs beint fetched go to stderr.
02:14:16		systwi_ closes freshly-started bash script progress... :-P
02:14:22	<@JAA>	Requires Python 3.8+
02:15:01	<@JAA>	Actually hang on
02:15:32	<systwi_>	Thank you! This will be extremely useful. My current way of getting the next notes page was...you'll probably laugh...keeping my End key held down and running an auto clicker, with the cursor positioned on "Show more notes."
02:16:05	<@JAA>	`python3 -c 'import itertools, re, requests, sys, urllib.parse; s = requests.Session(); url = sys.argv[1]; {print(f"Fetching {url}", file = sys.stderr) or print((r := s.get(url)).text) or ((m := re.search(r"\x27/notes/\d+/[A-Za-z0-9]+\?from_c=\d+(?=\x27)", r.text)) and (url := urllib.parse.urljoin(url, m.group(0)[1:])) and True) or 1/0 for _ in itertools.count()}' URL`
02:16:10	<systwi_>	It worked great...until the page froze.
02:16:29	<@JAA>	Yeah, that's usually how it goes. It works well until it doesn't. :-)
02:16:34	<systwi_>	:-)
02:16:45	<@JAA>	This one should be constant memory, unlike the first version.
02:16:54	<systwi_>	Thanks. will try this. Much appreciated.
02:16:56	<@JAA>	Can be started from the post page, too.
02:17:03	<systwi_>	Ooh perfect.
02:18:12		systwi_ blindly runs code from a stranger
02:18:15	<systwi_>	:-P
02:18:49	<systwi_>	Joking, I trust it.
02:19:39		BlueMaxima quits [Client Quit]
02:20:34	<@JAA>	It's pretty easy to follow once you figure out the structure. Just a very hacky way to get it into a true one-liner. :-)
02:21:10	<@JAA>	It's basically `while url: r = requests.get(url); print(r.text); url = extract_next_page(r.text)`.
02:21:30	<systwi_>	Yeah, seemed pretty readable to me, even with my rudimentary Python skills.
02:21:47	<systwi_>	It's done. :-o
02:21:53	<@JAA>	Just written as a set comprehension because I figured out recently how to abuse those to get rid of awful '...'$'\n''...' constructs.
02:37:12		Chris5010 quits [Remote host closed the connection]
02:48:57	<jamesp>	Looks like kiwifarms is bach at kiwifarms.ru
02:49:00	<jamesp>	*back
02:49:12	<jamesp>	I was also told about a tor link
02:51:46	<@JAA>	Yeah, discussed earlier here and in #archivebot, the .ru domain uses DDoS-GUARD, and .onion is .onion.
03:01:01		Kinille quits []
03:14:34		Arcorann quits [Ping timeout: 265 seconds]
03:26:52	<Ryz>	Mono = 1, and rail = rail
03:37:27		Arcorann (Arcorann) joins
03:38:03		Kinille (Kinille) joins
04:12:53	<h2ibot>	John123521 edited Yuku.com (+1): https://wiki.archiveteam.org/?diff=48911&oldid=48850
04:12:55	<@JAA>	^ RIP
04:13:52	<h2ibot>	Pokechu22 edited ISP Hosting (+317, +coqui.net): https://wiki.archiveteam.org/?diff=48912&oldid=47729
04:13:53	<h2ibot>	JustAnotherArchivist changed the user rights of User:Pokechu22
04:14:53	<h2ibot>	JustAnotherArchivist changed the user rights of User:Nemo bis
04:16:26		march_happy quits [Ping timeout: 265 seconds]
04:17:00		march_happy (march_happy) joins
04:30:20		sec^nd quits [Remote host closed the connection]
04:37:25		sec^nd (second) joins
04:38:17		tbc1887 (tbc1887) joins
04:43:46		sec^nd quits [Ping timeout: 240 seconds]
04:50:29		sec^nd (second) joins
05:43:44		Barto quits [Read error: Connection reset by peer]
05:52:50		Barto (Barto) joins
06:18:43		march_happy quits [Ping timeout: 265 seconds]
06:18:58		march_happy (march_happy) joins
06:36:46		sec^nd quits [Ping timeout: 240 seconds]
06:43:24		sec^nd (second) joins
07:17:21		knecht420 quits [Quit: Ping timeout (120 seconds)]
07:17:27		knecht420 (knecht420) joins
07:22:16		sec^nd quits [Ping timeout: 240 seconds]
07:33:33		sec^nd (second) joins
08:15:30		tbc1887 quits [Read error: Connection reset by peer]
09:53:15		h joins
09:53:32		h quits [Remote host closed the connection]
10:21:32		tech_exorcist (tech_exorcist) joins
10:31:16		sec^nd quits [Ping timeout: 240 seconds]
10:38:06		sec^nd (second) joins
11:15:08		march_happy quits [Remote host closed the connection]
11:20:24		march_happy (march_happy) joins
11:32:29		march_happy quits [Remote host closed the connection]
11:35:28		march_happy (march_happy) joins
11:36:21		march_happy quits [Remote host closed the connection]
11:38:41		march_happy (march_happy) joins
11:45:43		march_happy quits [Remote host closed the connection]
11:48:17		march_happy (march_happy) joins
11:53:11		march_happy quits [Ping timeout: 265 seconds]
11:57:09		march_happy (march_happy) joins
12:03:10		fishingforpie joins
12:37:22		Iki joins
13:18:52		tech_exorcist quits [Client Quit]
13:40:26		fishingforpie quits [Remote host closed the connection]
13:51:13		michaelblob quits [Read error: Connection reset by peer]
13:52:35		dm4v_ joins
13:52:37		michaelblob (michaelblob) joins
13:54:59		dm4v quits [Ping timeout: 265 seconds]
13:54:59		dm4v_ is now known as dm4v
14:09:40		Arcorann quits [Ping timeout: 240 seconds]
14:44:30		flashfire42 quits [Quit: The Lounge - https://thelounge.chat]
14:44:31		kiska quits [Quit: The Lounge - https://thelounge.chat]
14:44:31		Ryz2 quits [Quit: The Lounge - https://thelounge.chat]
14:44:31		s-crypt quits [Quit: The Lounge - https://thelounge.chat]
14:47:43		Ryz2 (Ryz) joins
14:47:45		s-crypt (s-crypt) joins
14:47:51		flashfire42 (flashfire42) joins
14:48:52		kiska (kiska) joins
15:58:33		fishingforpie joins
16:00:43		tech_exorcist (tech_exorcist) joins
16:56:46		fishingforpie quits [Remote host closed the connection]
18:24:28		le0n quits [Ping timeout: 240 seconds]
18:27:09		le0n (le0n) joins
19:59:53	<h2ibot>	JustAnotherArchivist created TJournal (+16, Redirected page to [[TJ]]): https://wiki.archiveteam.org/?title=TJournal
20:44:16		sec^nd quits [Ping timeout: 240 seconds]
20:54:36		sec^nd (second) joins
21:20:34		datechnoman quits [Quit: The Lounge - https://thelounge.chat]
22:13:42		igloo22225 quits [Client Quit]
22:13:56		igloo22225 (igloo22225) joins
22:26:16		tech_exorcist quits [Client Quit]
22:43:23	<h2ibot>	Arkiver uploaded File:Tjournal-logo.png: https://wiki.archiveteam.org/?title=File%3ATjournal-logo.png
22:58:21		BlueMaxima joins
23:13:14		wickedplayer494 quits [Ping timeout: 265 seconds]
23:13:27	<h2ibot>	Switchnode edited TJ (+189, update project info/status): https://wiki.archiveteam.org/?diff=48915&oldid=48822
23:15:16		wickedplayer494 joins
23:17:27	<h2ibot>	Switchnode edited Template:Infobox project (+4, reflect hackint as default irc network): https://wiki.archiveteam.org/?diff=48916&oldid=48856
23:20:20		wickedplayer494 is now authenticated as wickedplayer494
23:26:55		Arcorann (Arcorann) joins

Home Search Previous day Next day