| 00:15:09 | | megapro17 joins |
| 00:16:41 | <megapro17> | hi everyone. i want to ask a question, what is the best way to archive a twitter page? i want a human readable copy with pictures. as i found snscrape can only scrape to json format, without pictures. is there any more friendly solution? |
| 00:19:04 | <megapro17> | shouldn't be nitter great for this |
| 00:24:30 | <TheTechRobo> | megapro17: For what it's worth, snscrape does provide image metadata in the JSON. |
| 00:24:42 | <TheTechRobo> | You can download the pictures by parsing the URL out of the JSON. |
| 00:25:00 | <megapro17> | yes, but meeh you need to download them, store them somehow |
| 00:25:31 | <@JAA> | Data tends to have that issue, yeah. :-) |
| 00:25:50 | <megapro17> | well maybe someone invented a wheel |
| 00:28:24 | <megapro17> | parse with snscrape json as usual, then regex twimg and run wget on them |
| 00:38:42 | | megapro17 quits [Remote host closed the connection] |
| 00:38:50 | | megapro17 joins |
| 00:55:24 | | RisenRubix_ quits [Read error: Connection reset by peer] |
| 00:55:44 | | RisenRubix_ joins |
| 01:04:04 | | omglolbah quits [Ping timeout: 240 seconds] |
| 01:05:07 | | omglolbah joins |
| 01:12:34 | | megapro17 quits [Remote host closed the connection] |
| 02:12:12 | | michaelblob quits [Client Quit] |
| 02:27:13 | | omglolbah quits [Read error: Connection reset by peer] |
| 02:27:44 | | omglolbah joins |
| 02:45:40 | | omglolbah quits [Ping timeout: 240 seconds] |
| 02:50:47 | | omglolbah joins |
| 02:53:16 | | lennier1 quits [Ping timeout: 240 seconds] |
| 02:54:18 | | lennier1 (lennier1) joins |
| 03:00:28 | | tzt quits [Ping timeout: 265 seconds] |
| 03:01:43 | | tzt (tzt) joins |
| 03:26:31 | | katocala joins |
| 03:27:11 | | katocala is now authenticated as katocala |
| 03:36:04 | | omglolbah quits [Ping timeout: 240 seconds] |
| 03:37:12 | | omglolbah joins |
| 03:53:19 | | AnotherIki joins |
| 03:56:52 | | Iki1 quits [Ping timeout: 240 seconds] |
| 04:04:16 | | ThreeHM quits [Ping timeout: 265 seconds] |
| 04:04:38 | | ThreeHM (ThreeHeadedMonkey) joins |
| 04:38:49 | <lennier1> | The Verge is reporting that Elon Musk discussed putting Twitter behind a paywall. Obviously Musk discusses a lot of stuff, but for sure that business seems like a mess right now. https://www.theverge.com/2022/11/7/23446262/elon-musk-twitter-paywall-possible |
| 04:44:52 | | geezabiscuit quits [Ping timeout: 240 seconds] |
| 04:45:07 | | geezabiscuit (geezabiscuit) joins |
| 05:00:00 | | treora quits [Quit: blub blub.] |
| 05:01:24 | | treora joins |
| 05:03:55 | | Iki1 joins |
| 05:04:20 | | RisenRubix__ joins |
| 05:04:20 | | RisenRubix_ quits [Remote host closed the connection] |
| 05:04:20 | | AnotherIki quits [Remote host closed the connection] |
| 05:09:46 | | RisenRubix_ joins |
| 05:10:31 | | RisenRubix__ quits [Remote host closed the connection] |
| 05:19:41 | | omglolbah_ joins |
| 05:22:19 | | omglolbah quits [Ping timeout: 265 seconds] |
| 05:22:19 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 05:22:19 | | Barto quits [Ping timeout: 265 seconds] |
| 05:22:19 | | programmerq quits [Ping timeout: 265 seconds] |
| 05:22:19 | | Somebody2 quits [Ping timeout: 265 seconds] |
| 05:22:19 | | kpcyrd quits [Ping timeout: 265 seconds] |
| 05:22:19 | | Jonimus quits [Ping timeout: 265 seconds] |
| 05:22:34 | | kpcyrd (kpcyrd) joins |
| 05:22:34 | | programmerq (programmerq) joins |
| 05:22:39 | | Barto (Barto) joins |
| 05:22:41 | | Jonimus joins |
| 05:22:42 | | Somebody2 joins |
| 05:26:52 | | tzt quits [Ping timeout: 240 seconds] |
| 05:28:06 | | tzt (tzt) joins |
| 05:49:07 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 05:53:25 | | balrog quits [Quit: Bye] |
| 05:58:11 | | h2ibot quits [Remote host closed the connection] |
| 05:58:24 | | h2ibot (h2ibot) joins |
| 05:59:40 | | qwertyasdfuiopghjkl joins |
| 06:02:04 | | balrog (balrog) joins |
| 06:08:24 | | Arcorann (Arcorann) joins |
| 06:26:50 | | JackThompson quits [Ping timeout: 268 seconds] |
| 07:05:12 | | Czechball joins |
| 07:11:12 | <SketchCow> | Someone please mirror: https://www.youtube.com/watch?v=Wn_WPK-xFoQ |
| 07:18:38 | <SketchCow> | Wikiteam, please mirror http://en.techinfodepot.shoutwiki.com/wiki/Main_Page |
| 07:50:58 | | sknebel quits [Remote host closed the connection] |
| 07:52:28 | | @AlsoJAA quits [Ping timeout: 240 seconds] |
| 07:52:42 | | sknebel (sknebel) joins |
| 07:53:19 | | AlsoJAA (JAA) joins |
| 07:53:19 | | @ChanServ sets mode: +o AlsoJAA |
| 08:13:13 | | sonick quits [Client Quit] |
| 08:24:29 | | Adrmcr (Adrmcr) joins |
| 08:25:45 | | Adrmcr quits [Remote host closed the connection] |
| 08:25:59 | | Adrmcr (Adrmcr) joins |
| 08:32:20 | | JackThompson joins |
| 08:32:37 | <Adrmcr> | Posted a description of this in #down-the-tube already, but one of my youtube accounts has strangely gotten access to view comments on every art track channel that isn't a "- Topic" again, like C418's minecraft volume beta and Lena Raine's celeste farewell music. |
| 08:33:16 | <Adrmcr> | Posting comments on those videos doesn't work, though. |
| 08:40:54 | | JackThompson4 joins |
| 08:41:16 | | JackThompson quits [Ping timeout: 268 seconds] |
| 08:41:16 | | JackThompson4 is now known as JackThompson |
| 09:09:53 | <JTL> | Adrmcr: If it's what I think it is. I can see the same thing not being signed in, but in my main browser with all the Google cookies, but if I try an incogntio window "Comments are turned off" |
| 09:09:53 | | RisenRubix_ quits [Read error: Connection reset by peer] |
| 09:09:57 | <JTL> | what the heck google :P |
| 09:10:10 | <JTL> | exact same video |
| 09:10:14 | | RisenRubix_ joins |
| 09:22:00 | | Sluggs joins |
| 10:21:05 | | Adrmcr quits [Remote host closed the connection] |
| 10:38:18 | | RisenRubix__ joins |
| 10:39:26 | | Czechball quits [Client Quit] |
| 10:39:26 | | RisenRubix_ quits [Remote host closed the connection] |
| 10:39:30 | | Czechball joins |
| 10:52:50 | | sec^nd quits [Remote host closed the connection] |
| 10:53:10 | | sec^nd (second) joins |
| 11:35:55 | | Megame (Megame) joins |
| 11:51:12 | | dm4v quits [Ping timeout: 268 seconds] |
| 11:54:09 | | Megame quits [Client Quit] |
| 12:03:57 | | dm4v joins |
| 12:04:48 | | Straif quits [Quit: Connection closed for inactivity] |
| 12:07:50 | | sonick (sonick) joins |
| 12:42:28 | | Arcorann quits [Ping timeout: 240 seconds] |
| 12:44:33 | | Megame (Megame) joins |
| 12:57:54 | | programmerq quits [Remote host closed the connection] |
| 13:09:54 | | Iki joins |
| 13:10:12 | | borislav joins |
| 13:12:28 | | Iki1 quits [Ping timeout: 240 seconds] |
| 13:15:42 | | Megame quits [Client Quit] |
| 13:30:32 | | Czechball8 joins |
| 13:30:34 | | Czechball quits [Client Quit] |
| 13:30:34 | | borislav quits [Remote host closed the connection] |
| 13:30:34 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 13:30:34 | | Czechball8 is now known as Czechball |
| 13:43:11 | | Adrmcr (Adrmcr) joins |
| 13:48:28 | | qwertyasdfuiopghjkl joins |
| 13:57:14 | | programmerq (programmerq) joins |
| 14:07:41 | | tech_exorcist (tech_exorcist) joins |
| 16:10:38 | | HP_Archivist (HP_Archivist) joins |
| 16:34:45 | | lennier1 quits [Client Quit] |
| 16:36:44 | | lennier1 (lennier1) joins |
| 16:45:08 | | Adrmcr quits [Remote host closed the connection] |
| 16:46:00 | | march_happy quits [Ping timeout: 265 seconds] |
| 16:46:14 | | march_happy (march_happy) joins |
| 17:02:02 | | HP_Archivist quits [Client Quit] |
| 18:03:09 | | michaelblob (michaelblob) joins |
| 18:09:23 | | Lord_Nightmare quits [Quit: ZNC - http://znc.in] |
| 18:14:15 | | Lord_Nightmare (Lord_Nightmare) joins |
| 18:20:07 | <IDK> | am I the only one to say that wayback machine is getting really slow right now |
| 18:20:22 | <IDK> | sometimes does not respond at all for a few minutes |
| 18:21:14 | <@JAA> | #internetarchive |
| 18:45:43 | | sknebel quits [Client Quit] |
| 18:45:43 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 18:45:43 | | Doranwen quits [Remote host closed the connection] |
| 18:45:54 | | Doranwen (Doranwen) joins |
| 18:46:52 | | sknebel (sknebel) joins |
| 18:47:51 | | qwertyasdfuiopghjkl joins |
| 18:50:59 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 18:51:14 | | qwertyasdfuiopghjkl joins |
| 18:53:45 | | Chris5010 quits [Quit: ] |
| 19:00:23 | | mut4ntmonkey quits [Remote host closed the connection] |
| 19:01:14 | | mut4ntmonkey (mutantmonkey) joins |
| 19:08:06 | | mut4ntmonkey quits [Remote host closed the connection] |
| 19:08:36 | | mut4ntmonkey (mutantmonkey) joins |
| 19:13:56 | | Chris5010 (Chris5010) joins |
| 19:16:04 | | Chris5010 quits [Client Quit] |
| 19:35:40 | | tzt quits [Remote host closed the connection] |
| 19:36:01 | | tzt (tzt) joins |
| 19:42:16 | <@JAA> | That Twitter scraping is done, 20.9 million tweets. |
| 19:46:06 | | mwfc (mwfc) joins |
| 19:51:57 | <mwfc> | Hej, I suspect something re Twitter for a grab is in the making or running? |
| 20:03:26 | | Chris5010 (Chris5010) joins |
| 20:29:48 | | Czechball quits [Client Quit] |
| 20:30:36 | | Czechball joins |
| 21:06:02 | | Chris5010 quits [Ping timeout: 265 seconds] |
| 21:07:31 | | DopefishJustin quits [Remote host closed the connection] |
| 21:10:49 | | DopefishJustin joins |
| 21:10:49 | | DopefishJustin is now authenticated as DopefishJustin |
| 21:11:27 | | tech_exorcist quits [Client Quit] |
| 21:16:13 | <betamax> | JAA: should I be doing anything with the campaign sites? I can't see any docs for #Y, so I'm inclined to just try and do a grab using my residential connection... |
| 21:16:31 | <betamax> | (but if there's a way using #Y that the end results could end up in wayback, that would be much better) |
| 21:26:11 | <@JAA> | betamax: Yeah, there are no docs, and setting projects up is a manual thing currently that only arkiver can do. You might want to run something yourself. If it's WARC, we can always get it into the WBM later. |
| 21:26:52 | <@JAA> | Please use either an old wget version or wget-at though. Current upstream wget produces weird WARCs. |
| 21:29:08 | <@JAA> | I don't remember which wget version first had the bug, but this is the starting point if you want to dig around: https://github.com/webrecorder/warcio/pull/42 |
| 21:31:22 | | borislav joins |
| 21:33:14 | <betamax> | ah, thanks for the heads up (I would have just used the latest version) |
| 21:36:46 | | igloo22225 quits [Quit: Ping timeout (120 seconds)] |
| 21:37:00 | | igloo22225 (igloo22225) joins |
| 22:15:09 | | mut4ntmonkey quits [Remote host closed the connection] |
| 22:16:22 | | mut4ntmonkey (mutantmonkey) joins |
| 22:18:09 | | mut4ntm0nkey (mutantmonkey) joins |
| 22:18:21 | | mut4ntmonkey quits [Remote host closed the connection] |
| 22:20:53 | | mut4ntm0nkey quits [Remote host closed the connection] |
| 22:21:21 | | mut4ntm0nkey (mutantmonkey) joins |
| 22:32:07 | | RisenRubix__ quits [Read error: Connection reset by peer] |
| 22:34:10 | | katocala quits [Remote host closed the connection] |
| 22:39:25 | | BlueMaxima joins |
| 23:07:12 | | Ketchup901 quits [Ping timeout: 255 seconds] |
| 23:11:12 | | Ketchup901 (Ketchup901) joins |
| 23:12:58 | <betamax> | The command I plan to run is the following: |
| 23:13:01 | <betamax> | wget --mirror --timeout=5 --tries=1 --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" --page-requisites --warc-file=01 --delete-after -o 1.log http://aaronforrep.com/ |
| 23:13:13 | <betamax> | it's not doing 100% what I want |
| 23:14:12 | <betamax> | (the images, which are on an external domain, are not saved, but adding --span-hosts made it start crawling other sites which will take too long |
| 23:14:24 | <betamax> | I'll start running it tomorrow (ran out of time now) |
| 23:14:28 | <@JAA> | Yeah, wget doesn't have --span-hosts-allow like wpull does. |
| 23:14:59 | <@JAA> | I don't think it's possible to have it recurse to off-site page requisites but not off-site links, but not entirely certain. |
| 23:15:18 | <betamax> | it's not major, the HTML/content is the main thing |
| 23:17:39 | <betamax> | JAA: could I just switch to wpull? |
| 23:18:04 | <@JAA> | betamax: Depends on whether one of your kinks is masochism. |
| 23:18:41 | <@JAA> | But you could give it a try with wpull 1.2.3. 2.0.x is basically unusable standalone. |
| 23:18:51 | <betamax> | ah, hmm, maybe not |
| 23:19:09 | <betamax> | I need to get this running ASAP :) |
| 23:27:27 | | DLoader quits [Ping timeout: 255 seconds] |
| 23:32:56 | | DLoader joins |
| 23:33:38 | | Ketchup901 quits [Remote host closed the connection] |
| 23:34:27 | <jodizzle> | Any reason to not use grab-site for this task? |
| 23:34:49 | | Ketchup901 (Ketchup901) joins |
| 23:37:04 | | Atom-- quits [Read error: Connection reset by peer] |
| 23:48:54 | <@JAA> | Actually, yeah, probably the best option. You can just start N crawls, then `touch stop` in each crawl's directory after a few minutes and wait for the processes to exit, then start the next batch. |
| 23:54:49 | | Justin[home] joins |
| 23:54:49 | | Justin[home] is now authenticated as DopefishJustin |
| 23:54:56 | | programmerq quits [Client Quit] |
| 23:55:05 | | BlueMaxima quits [Remote host closed the connection] |
| 23:55:05 | | DopefishJustin quits [Remote host closed the connection] |
| 23:55:05 | | borislav quits [Remote host closed the connection] |
| 23:55:09 | | BlueMaxima joins |
| 23:58:09 | | BlueMaxima quits [Remote host closed the connection] |
| 23:58:12 | | BlueMaxima joins |
| 23:58:48 | | DLoader_ joins |