00:00:45dm4v quits [Read error: Connection reset by peer]
00:02:10dm4v joins
00:02:13dm4v quits [Changing host]
00:02:13dm4v (dm4v) joins
00:06:02AlsoHP_Archivist joins
00:23:31<@JAA>pabs: FYI, I'm going to let the ArchiveBot jobs for GNOME Bugzilla finish. It might be worth contacting them about the issue history (which is just gone) and XML export (for programmatic access, returns the normal view now) and possibly attachment description page (returns the attachment instead), but I won't have time for that anytime soon.
00:23:50<@JAA>The actual issues and attachments exist, so at least that will be covered.
00:46:00AlsoHP_Archivist quits [Ping timeout: 250 seconds]
00:46:49AlsoHP_Archivist joins
01:02:24dm4v quits [Client Quit]
01:04:36dm4v joins
01:04:38dm4v quits [Changing host]
01:04:38dm4v (dm4v) joins
01:18:10Video joins
01:32:52Arcorann (Arcorann) joins
01:42:46Arcorann quits [Ping timeout: 250 seconds]
02:15:31<pabs>JAA: do you have some example URLs that are broken? if so I could file an issue (also if you have a GNOME GitLab account I could CC you)
02:17:53<@JAA>pabs: Random example from a bug that was fully covered: https://bugzilla.gnome.org/show_bug.cgi?id=36951 → history https://web.archive.org/web/20210712102539/https://bugzilla.gnome.org/show_activity.cgi?id=36951 and XML version https://web.archive.org/web/20210712102539/https://bugzilla.gnome.org/show_bug.cgi?ctype=xml&id=36951
02:20:37<@JAA>The attachment description page would be e.g. https://bugzilla.gnome.org/attachment.cgi?id=94167&action=edit . This URL was captured but isn't in the WBM yet because the WARC is still sitting on the ArchiveBot pipeline.
02:21:25<@JAA>I don't have a GNOME GitLab account.
02:25:12Atom joins
02:28:55<pabs>ok, I'll take a look later
02:29:14<@JAA>Cheers
02:29:34Barto quits [Ping timeout: 250 seconds]
02:29:46Arcorann (Arcorann) joins
02:32:10AntiLiberal joins
02:39:58Arcorann quits [Ping timeout: 250 seconds]
03:07:56Barto (Barto) joins
03:27:38Video quits [Ping timeout: 250 seconds]
03:34:28BlueMaxima joins
03:45:47qw3rty__ joins
03:46:45Video joins
03:49:18qw3rty_ quits [Ping timeout: 250 seconds]
04:42:08nico_32 quits [Remote host closed the connection]
04:45:51nico_32 (nico) joins
04:54:53jacobk quits [Client Quit]
04:57:26jacobk joins
05:13:05jacobk quits [Ping timeout: 244 seconds]
05:13:06jacobk_ joins
05:28:58AntiLiberal quits [Ping timeout: 252 seconds]
06:38:18fuzzy8021 quits [Ping timeout: 250 seconds]
07:02:58<h2ibot>Tech234a edited YouTube (+80, /* Older unlisted videos (July 2021) */ Add…): https://wiki.archiveteam.org/?diff=47020&oldid=47013
07:17:00<h2ibot>Tech234a edited YouTube (+819, /* Older unlisted videos (July 2021) */ Add…): https://wiki.archiveteam.org/?diff=47021&oldid=47020
07:25:09fuzzy8021 (fuzzy8021) joins
07:29:26AlsoHP_Archivist quits [Ping timeout: 250 seconds]
08:11:16graf_ joins
08:13:12grafck quits [Ping timeout: 250 seconds]
08:18:31BlueMaxima quits [Read error: Connection reset by peer]
08:33:39Sanqui_ is now known as Sanqui
08:59:55<@OrIdow6>Is there a channel for Google Drive?
09:51:32Arcorann (Arcorann) joins
10:46:14Megame (Megame) joins
10:58:44pabs quits [Ping timeout: 244 seconds]
11:04:02pabs (pabs) joins
11:38:33Arcorann quits [Remote host closed the connection]
11:58:32Wayward- quits [Ping timeout: 250 seconds]
12:08:16mutantmonkey quits [Remote host closed the connection]
12:08:31mutantmonkey (mutantmonkey) joins
12:44:03<rewby>thuban: Your regex doesn't produce any results. (And I've scanned the whole dataset)
12:47:02<thuban>fuck, i left an asterisk out. '(file|image):\s*"([^"]*)",'
12:49:55Arcorann joins
13:00:11<rewby>thuban: Yeah. I figured that. Do you care about having file and image separate, or do you just want one big list?
13:03:13<thuban>separate, if that wouldn't require any effort on your part; otherwise together
13:03:42<thuban>(i _think_ we already got all the thumbnails in the regular ab run, but i need to check)
13:04:57<rewby>Cool. That's easy. My system doesn't really do capture groups so I have to do a second pass to get the urls out of the 'image: "<url>"' strings
13:05:19<rewby>It gives me a big list of regex matches per warc
13:05:28<rewby>And then I post-process from there
13:07:07<rewby>I also only process text/<whatever> and application/json entries. I don't match on image or video files, for obvious reaons
13:07:09<rewby>*reasons
13:09:57<rewby>thuban: I've updated the regexes and am re-running. It looks to be obtaining urls.
13:10:54<thuban>i actually tested it this time, haha
13:11:36<rewby>Cool
13:11:45<rewby>I'll get you a sample of one file just to check it by you
13:26:21<rewby>thuban: Here's a sample from one of the warc files: https://transfer.archivete.am/w7DSd/file.txt https://transfer.archivete.am/orEBX/image.txt
13:26:31<rewby>This look good to you?
13:27:11<thuban>yep!
13:27:37<rewby>Cool. Still processing the rest. But I'm doing this singlethreaded because I'm lazy.
13:28:06<rewby>It's got maybe 10 minutes left
13:29:10qwertyasdfuiopghjkl joins
13:32:31<rewby>thuban: It's a midly hacked up extractor, but it's doing the job. https://s3.services.ams.aperture-laboratories.science/rewby/public/2a4b8143-8fbb-406f-8880-503b8032405f/1627911131.0666816.png
13:34:17<thuban>nice
13:37:54<rewby>thuban: All done! https://transfer.archivete.am/15RnqL/file.txt https://transfer.archivete.am/CHsW5/image.txt
13:55:13AntiLiberal joins
14:18:11Guest69 joins
14:29:35<thuban>rewby: for some reason i'm getting only 86 unique urls from either of those files when there should be many more.
14:29:45<thuban>for example: https://app4.rthk.hk/special/rthkmemory/details/hk-footprints/108 is in the warcs, and running the (corrected) regex on that page yields 'file: "https://app4.rthk.hk/podcast/media/rthkmemory/b_v08.mp4",' and 'image: "https://rthkmemorycms.rthk.hk/photo/media/thumbnail/108",'
14:29:53<thuban>but 'https://app4.rthk.hk/podcast/media/rthkmemory/b_v08.mp4' is not in file.txt and 'https://rthkmemorycms.rthk.hk/photo/media/thumbnail/108' is not in image.txt.
14:30:05<rewby>Uh. Lemme check
14:31:50<thuban>idk what your plumbing looks like, but is it possible you ran one warc repeatedly instead of all the warcs? (24 warcs, 24 copies of each url i _do_ have)
14:32:13<thuban>(oh wait nvm, 25 warcs)
14:33:02<rewby>thuban: d'oh. I ran all the warcs, but I didn't concat the results properly
14:33:11<rewby>Lemme fix that
14:33:46<thuban>gotcha, thanks
14:33:58<rewby>thuban: How's this? https://transfer.archivete.am/JLRYd/file.txt https://transfer.archivete.am/sInqE/image.txt
14:34:52<rewby>Hm. Still not quite right I think
14:36:56<rewby>It's better but still not quite there
14:39:00<thuban>yeah... i do expect there to be a few copies of each result (each detail page has a base url and then two possible language parameters) but that's not what it looks like is happening
14:40:14<rewby>I'm double checking a few things.
14:40:27<rewby>Hmmm.
14:40:34<rewby>I wonder if we're dealing with an encoding problem
14:41:19<rewby>thuban: I'm doing another run with some tweaks that might help.
14:41:49<rewby>If you still find missing things, I'll have to go and manually dig into the warcs to see what wrong because that'll be a bug with my warc reader
14:43:33<thuban>i think to confirm anything missing i would have to manually download and zgrep the warcs--that other one was just a lucky spot-check
14:43:47<rewby>Fair enough
14:43:56<rewby>Just zgrepping doesn't always work
14:44:10<thuban>oh?
14:44:26<rewby>The problem is that warcs contain raw http responses. Which means your content can be encoded a number of ways. It's not uncomming to have a gzipped response or brotli compressed response
14:44:34<thuban>ah, yeah
14:44:35<rewby>*uncommon
14:45:09<rewby>There's a lot of screwery going on in this software to try and deal with this
14:46:32<thuban>i knew there was a reason i asked you instead of trying to do it myself ;)
14:57:55graf_ quits [Read error: Connection reset by peer]
15:03:37Arcorann quits [Client Quit]
15:14:11<rewby>thuban: Here's another attempt. I turned off all the "smart"ness. It should've gotten everything unless there was a decoding issue. https://transfer.archivete.am/ILUoM/file_unique.txt https://transfer.archivete.am/xBjlu/image_unique.txt
15:26:49AntiLiberal quits [Ping timeout: 252 seconds]
15:28:54<thuban>yeah, that's more consistent with what i was expecting
15:35:38Video quits [Ping timeout: 250 seconds]
15:45:08<thuban>huh... so it looks like archivebot successfully got everything (except a couple of m3u8s) in the original run. i wonder why playback doesn't work in the wbm?
15:45:52<rewby>Are there any POST requests involved?
15:46:14<rewby>Or maybe javascript that's unhappy?
15:46:52<thuban>lol, the only requests that fail are jwplayer's jwpsrv.js and sharing.js, which have somehow been double-rewritten: e.g. https://web.archive.org/web/20210728093807/https://web.archive.org/web/20210728093807/https://ssl.p.jwpcdn.com/6/8/jwpsrv.js .
15:47:29<thuban>(single-rewritten does exist in the archive and presumably would work.)
15:47:59<rewby>Huh. Interesting quirk
15:54:16driib quits [Client Quit]
15:54:35driib (driib) joins
16:02:07<thuban>https://web.archive.org/web/20210728093807js_/https://app4.rthk.hk/special/rthkmemory/assets/js/jwplayer/jwplayer.js
16:05:20<thuban>in the 'c.repo' function (which returns the base path jwplayer uses to get some assets) the url is rewritten once when the string literal with the original cdn's url is used, then again when the generated url string is munged for ssl
16:07:21<thuban>i guess there's no principled way to avoid this...
16:25:44LeGoupil joins
16:31:59jacobk_ quits [Ping timeout: 244 seconds]
17:00:20Guest69 quits [Client Quit]
17:23:43Video joins
17:25:01Ruthalas quits [Client Quit]
17:25:21Ruthalas (Ruthalas) joins
17:30:22qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
17:34:04h3ndr1k quits [Client Quit]
17:42:50h3ndr1k (h3ndr1k) joins
17:51:43Iki joins
17:53:54Wingy7 (Wingy) joins
17:54:44Wingy quits [Ping timeout: 250 seconds]
17:54:44Wingy7 is now known as Wingy
17:55:00AlsoHP_Archivist joins
18:42:40Mateon2 joins
18:44:34Mateon1 quits [Ping timeout: 250 seconds]
18:44:34Mateon2 is now known as Mateon1
18:52:53CottonProphecy joins
19:27:25qwertyasdfuiopghjkl joins
19:56:35CottonProphecy quits [Ping timeout: 244 seconds]
19:57:03GunDigger joins
20:01:11jacobk joins
20:10:20LeGoupil quits [Client Quit]
20:13:45CottonProphecy joins
20:17:36G4te_Keep3r joins
20:30:58Video quits [Ping timeout: 252 seconds]
20:38:21<wizards>has anyone archived drivers, manuals, sdks and the like from canon's website? just figured i should ask before trying to archive it myself
20:38:38<AK>Got a link to the site and we can check?
20:41:27<wizards>example page for a specific camera: https://www.usa.canon.com/internet/portal/us/home/support/details/cameras/point-and-shoot-digital-cameras/slim-stylish-cameras/powershot-a2500/powershot-a2500?tab=drivers_downloads
20:41:36<wizards>and the place where i got that link from: https://www.usa.canon.com/internet/portal/us/home/support/drivers-downloads
20:43:08<AK>Hmm, I could give it a go in AB and see how it goes
20:44:00<wizards>might work for things like the reference photos, but the section that lists downloads uses js and probably would need manual work to scrape
20:44:08<wizards>i was writing a lua script to do exactly that
20:45:21<AK>Urgh, same for the manuals, it's all js
20:45:42<AK>It's all running through AB now anyway so we at least grab what we can
20:48:23<thuban>if you write your script to get the urls for the downloads, we can run that list through archivebot, too, so that at least the files will be in the wayback machine
20:50:48<AK>^Forgot about that
20:53:56CottonProphecy quits [Ping timeout: 244 seconds]
20:55:05<wizards>in my list of urls, should i include the original urls or the ones they redirect to? since all of them are redirects
20:55:10<wizards>e.g. https://pdisp01.c-wss.com/gdl/WWUFORedirectTarget.do?id=MDMwMDAxMDYyODAx&cmp=ABR&lang=EN
20:55:45<AK>Original means we'll archive the redirect too
20:55:46<thuban>archivebot can follow redirects, so it's probably best to use the originals (since that way both will point to the file)
20:55:57<thuban>^ what he said
21:00:44<@JAA>Uh
21:01:03<@JAA>Generally yes, but it depends.
21:01:47<@JAA>If all of the downloads behave like the above, i.e. the actual downloads are on a different host, it's fine.
21:26:38jacobk quits [Ping timeout: 250 seconds]
22:25:42<h2ibot>OrIdow6 edited Framasoft (+79, Correction on discovery source): https://wiki.archiveteam.org/?diff=47022&oldid=47014
22:35:39AlsoHP_Archivist quits [Client Quit]
22:38:28HP_Archivist (HP_Archivist) joins
23:05:34Ruthalas8 (Ruthalas) joins
23:05:57Ruthalas quits [Read error: Connection reset by peer]
23:05:57Ruthalas8 is now known as Ruthalas
23:51:36Arcorann joins
23:51:43BlueMaxima joins