| 00:01:21 | | ramparts quits [Remote host closed the connection] |
| 00:03:04 | | ramparts joins |
| 00:06:38 | <@JAA> | jrwr: There's some weirdness happening with HTTP-to-HTTPS redirects on the wiki domain. From #archivebot: 00:04:12 <+pokechu22> Oh, that's interesting: http://archiveteam.org/images/e/e6/Archiveteam.jpg -> https://archiveteam.org/images/e/e6/Archiveteam.jpg -> http://wiki.archiveteam.org/images/e/e6/Archiveteam.jpg -> https://wiki.archiveteam.org/e/e6/Archiveteam.jpg |
| 00:06:46 | <@JAA> | The last one is missing 'images'. |
| 01:01:50 | | jacobk quits [Ping timeout: 265 seconds] |
| 01:25:33 | | Justin[home] joins |
| 01:25:33 | | Justin[home] is now authenticated as DopefishJustin |
| 01:27:33 | | DopefishJustin quits [Ping timeout: 268 seconds] |
| 01:41:22 | | BlueMaxima quits [Client Quit] |
| 01:42:54 | | pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.] |
| 01:44:57 | | pabs (pabs) joins |
| 02:09:17 | | tzt quits [Remote host closed the connection] |
| 02:09:41 | | tzt (tzt) joins |
| 02:40:10 | <@jrwr> | Interesting, I'll look into it |
| 02:43:49 | | qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds] |
| 03:04:28 | | Atom quits [Ping timeout: 240 seconds] |
| 03:05:15 | | Atom joins |
| 03:18:13 | <monika> | tiktok is a nightmare to scrape at scale |
| 03:21:12 | | ramparts quits [Remote host closed the connection] |
| 03:21:53 | | michaelblob (michaelblob) joins |
| 03:24:59 | | Justin[home] quits [Remote host closed the connection] |
| 03:24:59 | | katocala quits [Remote host closed the connection] |
| 03:25:08 | | DopefishJustin joins |
| 03:25:09 | | DopefishJustin is now authenticated as DopefishJustin |
| 03:25:09 | | katocala joins |
| 03:25:16 | | michaelblob_ quits [Ping timeout: 240 seconds] |
| 03:25:51 | | katocala is now authenticated as katocala |
| 03:59:05 | <@arkiver> | please don't just start dumping tons of items of tiktok videos on IA |
| 04:01:38 | | march_happy quits [Ping timeout: 265 seconds] |
| 04:01:53 | | march_happy (march_happy) joins |
| 04:54:17 | | sonick quits [Client Quit] |
| 05:03:50 | | Jake8 (Jake) joins |
| 05:05:51 | | Jake quits [Ping timeout: 255 seconds] |
| 05:05:51 | | Jake8 is now known as Jake |
| 05:08:35 | | eroc19908 (eroc1990) joins |
| 05:10:10 | | eroc1990 quits [Ping timeout: 268 seconds] |
| 05:34:17 | | superkuh_ joins |
| 05:34:27 | | atphoenix_ (atphoenix) joins |
| 05:36:28 | | superkuh quits [Ping timeout: 240 seconds] |
| 05:36:52 | | atphoenix quits [Ping timeout: 240 seconds] |
| 05:49:25 | | Overlordz quits [Quit: Leaving] |
| 06:16:52 | | RisenRubix__ quits [Ping timeout: 240 seconds] |
| 06:23:11 | | michaelblob_ (michaelblob) joins |
| 06:27:15 | | michaelblob quits [Ping timeout: 268 seconds] |
| 06:41:02 | | RisenRubix joins |
| 06:51:07 | | superkuh_ quits [Remote host closed the connection] |
| 06:51:08 | | katocala quits [Remote host closed the connection] |
| 06:51:09 | | superkuh_ joins |
| 06:51:34 | | katocala joins |
| 06:54:08 | | superkuh__ joins |
| 06:55:02 | | superkuh_ quits [Remote host closed the connection] |
| 07:18:39 | | qwertyasdfuiopghjkl joins |
| 08:13:21 | <drexler> | arkiver, I'd probably be willing to take them, though I concur that the scrape itself would be a nightmare. |
| 08:57:06 | | RisenRubix quits [Ping timeout: 268 seconds] |
| 09:02:39 | | march_happy quits [Ping timeout: 268 seconds] |
| 09:03:57 | | march_happy (march_happy) joins |
| 09:06:32 | | RisenRubix joins |
| 09:09:40 | | DopefishJustin quits [Ping timeout: 240 seconds] |
| 09:09:56 | | DopefishJustin joins |
| 09:09:56 | | DopefishJustin is now authenticated as DopefishJustin |
| 09:16:37 | | sonick (sonick) joins |
| 10:31:22 | | RisenRubix_ joins |
| 10:32:07 | | RisenRubix quits [Remote host closed the connection] |
| 11:17:36 | | march_happy quits [Ping timeout: 265 seconds] |
| 11:31:21 | | march_happy (march_happy) joins |
| 11:36:21 | | katocala is now authenticated as katocala |
| 12:26:21 | <@arkiver> | drexler: the scrape itself may be doable |
| 14:02:39 | | h3ndr1k quits [Client Quit] |
| 14:03:01 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 14:03:40 | | h3ndr1k (h3ndr1k) joins |
| 14:13:19 | | borislav joins |
| 14:19:16 | | Arcorann quits [Ping timeout: 240 seconds] |
| 14:27:15 | | tech_exorcist (tech_exorcist) joins |
| 14:40:46 | | MAddario joins |
| 14:43:46 | | ave quits [Read error: Connection reset by peer] |
| 14:44:07 | | ave (ave) joins |
| 14:44:50 | | fishingforpie joins |
| 14:44:51 | | qwertyasdfuiopghjkl joins |
| 14:45:54 | | Megame (Megame) joins |
| 14:46:52 | | fishing quits [Ping timeout: 240 seconds] |
| 14:50:43 | | MAddario quits [Remote host closed the connection] |
| 15:27:00 | | Stilett0 joins |
| 15:28:04 | | Stiletto quits [Ping timeout: 268 seconds] |
| 15:28:28 | | Megame quits [Client Quit] |
| 15:36:19 | | sonick quits [Client Quit] |
| 16:03:25 | | Ram joins |
| 16:12:57 | <Ram> | Hey, I'd like to archive a bunch of sites related to municipal governments in Ontario (a province of Canada). It's several thousand websites, mostly simple, mostly small websites. I can code and write any scripts. Is this the right place for such a project? |
| 16:14:15 | <theblazehen|m> | Ram: Yeah it is. Note that someone may be able to chuck the list into #archivebot as well, no need for custom code |
| 16:16:57 | <Ram> | Can I just start throwing a couple of sites in and see if it works? |
| 16:17:42 | <theblazehen|m> | Unsure if you'd have appropriate permissions etc, but if it works I'm sure it'd be fine |
| 16:18:01 | <theblazehen|m> | You also aware of https://pypi.org/project/savepagenow/ ? |
| 16:18:30 | <@JAA> | I am that someone who could run a list through. And yeah, sounds suitable in principle, although thousands may take a little while. |
| 16:20:38 | | FalconK (FalconK) joins |
| 16:22:17 | <Ram> | Can we start with just https://fortfrances.civicweb.net/? (everything underneath, including all PDFs) |
| 16:23:30 | <Ram> | I played around with wget spider, there should be ~25k pages |
| 16:32:34 | <@JAA> | Sure. I see ASP.NET form navigation madness, but it still seems to work fairly well without JS anyway. |
| 16:34:24 | <h2ibot> | Holdenzanoish edited Blingee (+24, /* Shutdown notice */): https://wiki.archiveteam.org/?diff=49136&oldid=47801 |
| 16:35:25 | <h2ibot> | Petchea edited Tumblr (+311, /* History */): https://wiki.archiveteam.org/?diff=49137&oldid=48759 |
| 16:35:26 | <h2ibot> | Petchea edited Moegirlpedia (-1, online): https://wiki.archiveteam.org/?diff=49138&oldid=49135 |
| 16:36:31 | | march_happy quits [Ping timeout: 268 seconds] |
| 16:36:43 | | OrIdow6 (OrIdow6) joins |
| 16:36:43 | | @ChanServ sets mode: +o OrIdow6 |
| 16:38:31 | | tech_exorcist quits [Read error: Connection reset by peer] |
| 16:39:06 | | tech_exorcist (tech_exorcist) joins |
| 16:42:55 | | upintheairsheep joins |
| 16:43:27 | <upintheairsheep> | Hello, I plan on creating a universal comment scraper as Tubeup has deemed comments "out of scope" |
| 16:44:13 | <upintheairsheep> | The question is, should I make them a WARC with the original data, or a single JSON with complete data but standardized? |
| 16:49:02 | <upintheairsheep> | ??? |
| 16:50:26 | <upintheairsheep> | Supported sites will include YouTube, TikTok, Instagram, SoundCloud, Scratch, Discuss, Facebook Comments, Internet Archive, App Store and Play Store for now. |
| 16:52:02 | <upintheairsheep> | I will reverse engineer more sites as needed. |
| 16:53:52 | <theblazehen|m> | upintheairsheep: Do you have working code for TikTok? Here's what I use: https://gist.github.com/theblazehen/25c18eda95165e65fc5159942fb5e4db |
| 16:54:09 | <upintheairsheep> | Try TikTok-API |
| 16:54:35 | <theblazehen|m> | Have you gotten it working? ~3 months ago it hadn't been updated to work with the newer tiktok api changes |
| 16:55:05 | <upintheairsheep> | There is an issue on yt-dlp |
| 16:55:23 | <upintheairsheep> | It has useful vital information about TikTok comment extraction |
| 16:55:50 | <upintheairsheep> | And it doesn't need any cookie or user agent hackery, any browser works, even logged out |
| 16:55:59 | <theblazehen|m> | Cool. Well, just sharing my code above as it works, in case it's of interest to you. Haven't got time to work on the project myself recently |
| 16:56:07 | <upintheairsheep> | I have high DNS latency at the moment |
| 16:56:11 | <theblazehen|m> | Even after the first page of comments? |
| 16:56:27 | <upintheairsheep> | My wifi is broken right now |
| 16:56:39 | <upintheairsheep> | For all sites, except irc works fine for me |
| 16:59:52 | | upintheairsheep45 joins |
| 17:01:38 | <upintheairsheep45> | tiktok-comments.py does not seem to scrape replies in the code |
| 17:02:13 | | upintheairsheep quits [Ping timeout: 265 seconds] |
| 17:03:08 | <upintheairsheep45> | see https://github.com/yt-dlp/yt-dlp/issues/5037 |
| 17:03:09 | <theblazehen|m> | It does, you need to look at the (from memory) reply_to field or similar to get the parent comment ID |
| 17:03:30 | <upintheairsheep45> | I've got HAR files for other sites |
| 17:03:43 | <upintheairsheep45> | SoundCloud and Scratch specifically |
| 17:03:55 | <theblazehen|m> | š¤·if new code that came out works, that's great. Just sharing what I had before there were other options |
| 17:04:37 | <upintheairsheep45> | What does it output, all tiktok comments |
| 17:04:50 | <theblazehen|m> | For the provided video_id, yes |
| 17:05:26 | <upintheairsheep45> | or url list of jsons |
| 17:06:52 | <theblazehen|m> | It outputs the raw comments |
| 17:08:34 | | theblazehen|m posted a file: (1926KiB) < https://matrix.hackint.org/_matrix/media/r0/download/matrix.org/XsqeAIgtPoQjvYEKazfMDNUA/comments.json > |
| 17:08:36 | <theblazehen|m> | From a random video |
| 17:12:51 | | upintheairsheep45 quits [Ping timeout: 265 seconds] |
| 17:13:31 | <h2ibot> | JustAnotherArchivist edited YouTube/Technical details (+440, More playlist types and some names discoveredā¦): https://wiki.archiveteam.org/?diff=49139&oldid=48455 |
| 17:25:18 | | Ram quits [Remote host closed the connection] |
| 17:40:14 | | upintheairsheep joins |
| 17:42:40 | | upintheairsheep quits [Remote host closed the connection] |
| 17:45:59 | | borislav quits [Remote host closed the connection] |
| 18:02:29 | <drexler> | arkiver, Then I'm all ears. :) |
| 19:02:40 | | Sluggs quits [Ping timeout: 268 seconds] |
| 19:03:36 | | Sluggs joins |
| 19:14:10 | | Sluggs quits [Ping timeout: 265 seconds] |
| 19:14:42 | | borislav joins |
| 19:17:13 | | Sluggs joins |
| 19:23:38 | | Sluggs quits [Ping timeout: 268 seconds] |
| 19:24:08 | | Sluggs joins |
| 19:37:29 | | jacksonchen666 (jacksonchen666) joins |
| 19:58:54 | | jacksonchen666 quits [Client Quit] |
| 21:06:35 | | borislav quits [Remote host closed the connection] |
| 21:14:38 | | qwertyasdfuiopghjkl quits [Client Quit] |
| 21:44:43 | | qwertyasdfuiopghjkl joins |
| 21:50:15 | | march_happy (march_happy) joins |
| 22:10:33 | | tech_exorcist quits [Client Quit] |
| 22:12:32 | | wickedplayer494 quits [Remote host closed the connection] |
| 22:18:28 | | Gereon6200 quits [Ping timeout: 240 seconds] |
| 22:18:35 | | wickedplayer494 joins |
| 22:18:53 | | wickedplayer494 is now authenticated as wickedplayer494 |
| 22:49:40 | | borislav joins |
| 22:52:08 | | Arcorann (Arcorann) joins |
| 23:55:46 | | Kuro joins |
| 23:56:07 | | sonick (sonick) joins |