00:01:21ramparts quits [Remote host closed the connection]
00:03:04ramparts joins
00:06:38<@JAA>jrwr: There's some weirdness happening with HTTP-to-HTTPS redirects on the wiki domain. From #archivebot: 00:04:12 <+pokechu22> Oh, that's interesting: http://archiveteam.org/images/e/e6/Archiveteam.jpg -> https://archiveteam.org/images/e/e6/Archiveteam.jpg -> http://wiki.archiveteam.org/images/e/e6/Archiveteam.jpg -> https://wiki.archiveteam.org/e/e6/Archiveteam.jpg
00:06:46<@JAA>The last one is missing 'images'.
01:01:50jacobk quits [Ping timeout: 265 seconds]
01:25:33Justin[home] joins
01:27:33DopefishJustin quits [Ping timeout: 268 seconds]
01:41:22BlueMaxima quits [Client Quit]
01:42:54pabs quits [Quit: Don't rest until all the world is paved in moss and greenery.]
01:44:57pabs (pabs) joins
02:09:17tzt quits [Remote host closed the connection]
02:09:41tzt (tzt) joins
02:40:10<@jrwr>Interesting, I'll look into it
02:43:49qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
03:04:28Atom quits [Ping timeout: 240 seconds]
03:05:15Atom joins
03:18:13<monika>tiktok is a nightmare to scrape at scale
03:21:12ramparts quits [Remote host closed the connection]
03:21:53michaelblob (michaelblob) joins
03:24:59Justin[home] quits [Remote host closed the connection]
03:24:59katocala quits [Remote host closed the connection]
03:25:08DopefishJustin joins
03:25:09katocala joins
03:25:16michaelblob_ quits [Ping timeout: 240 seconds]
03:59:05<@arkiver>please don't just start dumping tons of items of tiktok videos on IA
04:01:38march_happy quits [Ping timeout: 265 seconds]
04:01:53march_happy (march_happy) joins
04:54:17sonick quits [Client Quit]
05:03:50Jake8 (Jake) joins
05:05:51Jake quits [Ping timeout: 255 seconds]
05:05:51Jake8 is now known as Jake
05:08:35eroc19908 (eroc1990) joins
05:10:10eroc1990 quits [Ping timeout: 268 seconds]
05:34:17superkuh_ joins
05:34:27atphoenix_ (atphoenix) joins
05:36:28superkuh quits [Ping timeout: 240 seconds]
05:36:52atphoenix quits [Ping timeout: 240 seconds]
05:49:25Overlordz quits [Quit: Leaving]
06:16:52RisenRubix__ quits [Ping timeout: 240 seconds]
06:23:11michaelblob_ (michaelblob) joins
06:27:15michaelblob quits [Ping timeout: 268 seconds]
06:41:02RisenRubix joins
06:51:07superkuh_ quits [Remote host closed the connection]
06:51:08katocala quits [Remote host closed the connection]
06:51:09superkuh_ joins
06:51:34katocala joins
06:54:08superkuh__ joins
06:55:02superkuh_ quits [Remote host closed the connection]
07:18:39qwertyasdfuiopghjkl joins
08:13:21<drexler>arkiver, I'd probably be willing to take them, though I concur that the scrape itself would be a nightmare.
08:57:06RisenRubix quits [Ping timeout: 268 seconds]
09:02:39march_happy quits [Ping timeout: 268 seconds]
09:03:57march_happy (march_happy) joins
09:06:32RisenRubix joins
09:09:40DopefishJustin quits [Ping timeout: 240 seconds]
09:09:56DopefishJustin joins
09:16:37sonick (sonick) joins
10:31:22RisenRubix_ joins
10:32:07RisenRubix quits [Remote host closed the connection]
11:17:36march_happy quits [Ping timeout: 265 seconds]
11:31:21march_happy (march_happy) joins
12:26:21<@arkiver>drexler: the scrape itself may be doable
14:02:39h3ndr1k quits [Client Quit]
14:03:01qwertyasdfuiopghjkl quits [Client Quit]
14:03:40h3ndr1k (h3ndr1k) joins
14:13:19borislav joins
14:19:16Arcorann quits [Ping timeout: 240 seconds]
14:27:15tech_exorcist (tech_exorcist) joins
14:40:46MAddario joins
14:43:46ave quits [Read error: Connection reset by peer]
14:44:07ave (ave) joins
14:44:50fishingforpie joins
14:44:51qwertyasdfuiopghjkl joins
14:45:54Megame (Megame) joins
14:46:52fishing quits [Ping timeout: 240 seconds]
14:50:43MAddario quits [Remote host closed the connection]
15:27:00Stilett0 joins
15:28:04Stiletto quits [Ping timeout: 268 seconds]
15:28:28Megame quits [Client Quit]
15:36:19sonick quits [Client Quit]
16:03:25Ram joins
16:12:57<Ram>Hey, I'd like to archive a bunch of sites related to municipal governments in Ontario (a province of Canada). It's several thousand websites, mostly simple, mostly small websites. I can code and write any scripts. Is this the right place for such a project?
16:14:15<theblazehen|m>Ram: Yeah it is. Note that someone may be able to chuck the list into #archivebot as well, no need for custom code
16:16:57<Ram>Can I just start throwing a couple of sites in and see if it works?
16:17:42<theblazehen|m>Unsure if you'd have appropriate permissions etc, but if it works I'm sure it'd be fine
16:18:01<theblazehen|m>You also aware of https://pypi.org/project/savepagenow/ ?
16:18:30<@JAA>I am that someone who could run a list through. And yeah, sounds suitable in principle, although thousands may take a little while.
16:20:38FalconK (FalconK) joins
16:22:17<Ram>Can we start with just https://fortfrances.civicweb.net/? (everything underneath, including all PDFs)
16:23:30<Ram>I played around with wget spider, there should be ~25k pages
16:32:34<@JAA>Sure. I see ASP.NET form navigation madness, but it still seems to work fairly well without JS anyway.
16:34:24<h2ibot>Holdenzanoish edited Blingee (+24, /* Shutdown notice */): https://wiki.archiveteam.org/?diff=49136&oldid=47801
16:35:25<h2ibot>Petchea edited Tumblr (+311, /* History */): https://wiki.archiveteam.org/?diff=49137&oldid=48759
16:35:26<h2ibot>Petchea edited Moegirlpedia (-1, online): https://wiki.archiveteam.org/?diff=49138&oldid=49135
16:36:31march_happy quits [Ping timeout: 268 seconds]
16:36:43OrIdow6 (OrIdow6) joins
16:36:43@ChanServ sets mode: +o OrIdow6
16:38:31tech_exorcist quits [Read error: Connection reset by peer]
16:39:06tech_exorcist (tech_exorcist) joins
16:42:55upintheairsheep joins
16:43:27<upintheairsheep>Hello, I plan on creating a universal comment scraper as Tubeup has deemed comments "out of scope"
16:44:13<upintheairsheep>The question is, should I make them a WARC with the original data, or a single JSON with complete data but standardized?
16:49:02<upintheairsheep>???
16:50:26<upintheairsheep>Supported sites will include YouTube, TikTok, Instagram, SoundCloud, Scratch, Discuss, Facebook Comments, Internet Archive, App Store and Play Store for now.
16:52:02<upintheairsheep>I will reverse engineer more sites as needed.
16:53:52<theblazehen|m>upintheairsheep: Do you have working code for TikTok? Here's what I use: https://gist.github.com/theblazehen/25c18eda95165e65fc5159942fb5e4db
16:54:09<upintheairsheep>Try TikTok-API
16:54:35<theblazehen|m>Have you gotten it working? ~3 months ago it hadn't been updated to work with the newer tiktok api changes
16:55:05<upintheairsheep>There is an issue on yt-dlp
16:55:23<upintheairsheep>It has useful vital information about TikTok comment extraction
16:55:50<upintheairsheep>And it doesn't need any cookie or user agent hackery, any browser works, even logged out
16:55:59<theblazehen|m>Cool. Well, just sharing my code above as it works, in case it's of interest to you. Haven't got time to work on the project myself recently
16:56:07<upintheairsheep>I have high DNS latency at the moment
16:56:11<theblazehen|m>Even after the first page of comments?
16:56:27<upintheairsheep>My wifi is broken right now
16:56:39<upintheairsheep>For all sites, except irc works fine for me
16:59:52upintheairsheep45 joins
17:01:38<upintheairsheep45>tiktok-comments.py does not seem to scrape replies in the code
17:02:13upintheairsheep quits [Ping timeout: 265 seconds]
17:03:08<upintheairsheep45>see https://github.com/yt-dlp/yt-dlp/issues/5037
17:03:09<theblazehen|m>It does, you need to look at the (from memory) reply_to field or similar to get the parent comment ID
17:03:30<upintheairsheep45>I've got HAR files for other sites
17:03:43<upintheairsheep45>SoundCloud and Scratch specifically
17:03:55<theblazehen|m>🤷if new code that came out works, that's great. Just sharing what I had before there were other options
17:04:37<upintheairsheep45>What does it output, all tiktok comments
17:04:50<theblazehen|m>For the provided video_id, yes
17:05:26<upintheairsheep45>or url list of jsons
17:06:52<theblazehen|m>It outputs the raw comments
17:08:34theblazehen|m posted a file: (1926KiB) < https://matrix.hackint.org/_matrix/media/r0/download/matrix.org/XsqeAIgtPoQjvYEKazfMDNUA/comments.json >
17:08:36<theblazehen|m>From a random video
17:12:51upintheairsheep45 quits [Ping timeout: 265 seconds]
17:13:31<h2ibot>JustAnotherArchivist edited YouTube/Technical details (+440, More playlist types and some names discovered…): https://wiki.archiveteam.org/?diff=49139&oldid=48455
17:25:18Ram quits [Remote host closed the connection]
17:40:14upintheairsheep joins
17:42:40upintheairsheep quits [Remote host closed the connection]
17:45:59borislav quits [Remote host closed the connection]
18:02:29<drexler>arkiver, Then I'm all ears. :)
19:02:40Sluggs quits [Ping timeout: 268 seconds]
19:03:36Sluggs joins
19:14:10Sluggs quits [Ping timeout: 265 seconds]
19:14:42borislav joins
19:17:13Sluggs joins
19:23:38Sluggs quits [Ping timeout: 268 seconds]
19:24:08Sluggs joins
19:37:29jacksonchen666 (jacksonchen666) joins
19:58:54jacksonchen666 quits [Client Quit]
21:06:35borislav quits [Remote host closed the connection]
21:14:38qwertyasdfuiopghjkl quits [Client Quit]
21:44:43qwertyasdfuiopghjkl joins
21:50:15march_happy (march_happy) joins
22:10:33tech_exorcist quits [Client Quit]
22:12:32wickedplayer494 quits [Remote host closed the connection]
22:18:28Gereon6200 quits [Ping timeout: 240 seconds]
22:18:35wickedplayer494 joins
22:49:40borislav joins
22:52:08Arcorann (Arcorann) joins
23:55:46Kuro joins
23:56:07sonick (sonick) joins