| 00:00:45 | | dm4v quits [Read error: Connection reset by peer] |
| 00:02:10 | | dm4v joins |
| 00:02:13 | | dm4v is now authenticated as dm4v |
| 00:02:13 | | dm4v quits [Changing host] |
| 00:02:13 | | dm4v (dm4v) joins |
| 00:06:02 | | AlsoHP_Archivist joins |
| 00:23:31 | <@JAA> | pabs: FYI, I'm going to let the ArchiveBot jobs for GNOME Bugzilla finish. It might be worth contacting them about the issue history (which is just gone) and XML export (for programmatic access, returns the normal view now) and possibly attachment description page (returns the attachment instead), but I won't have time for that anytime soon. |
| 00:23:50 | <@JAA> | The actual issues and attachments exist, so at least that will be covered. |
| 00:46:00 | | AlsoHP_Archivist quits [Ping timeout: 250 seconds] |
| 00:46:49 | | AlsoHP_Archivist joins |
| 01:02:24 | | dm4v quits [Client Quit] |
| 01:04:36 | | dm4v joins |
| 01:04:38 | | dm4v is now authenticated as dm4v |
| 01:04:38 | | dm4v quits [Changing host] |
| 01:04:38 | | dm4v (dm4v) joins |
| 01:18:10 | | Video joins |
| 01:32:52 | | Arcorann (Arcorann) joins |
| 01:42:46 | | Arcorann quits [Ping timeout: 250 seconds] |
| 02:15:31 | <pabs> | JAA: do you have some example URLs that are broken? if so I could file an issue (also if you have a GNOME GitLab account I could CC you) |
| 02:17:53 | <@JAA> | pabs: Random example from a bug that was fully covered: https://bugzilla.gnome.org/show_bug.cgi?id=36951 → history https://web.archive.org/web/20210712102539/https://bugzilla.gnome.org/show_activity.cgi?id=36951 and XML version https://web.archive.org/web/20210712102539/https://bugzilla.gnome.org/show_bug.cgi?ctype=xml&id=36951 |
| 02:20:37 | <@JAA> | The attachment description page would be e.g. https://bugzilla.gnome.org/attachment.cgi?id=94167&action=edit . This URL was captured but isn't in the WBM yet because the WARC is still sitting on the ArchiveBot pipeline. |
| 02:21:25 | <@JAA> | I don't have a GNOME GitLab account. |
| 02:25:12 | | Atom joins |
| 02:28:55 | <pabs> | ok, I'll take a look later |
| 02:29:14 | <@JAA> | Cheers |
| 02:29:34 | | Barto quits [Ping timeout: 250 seconds] |
| 02:29:46 | | Arcorann (Arcorann) joins |
| 02:32:10 | | AntiLiberal joins |
| 02:39:58 | | Arcorann quits [Ping timeout: 250 seconds] |
| 03:07:56 | | Barto (Barto) joins |
| 03:27:38 | | Video quits [Ping timeout: 250 seconds] |
| 03:34:28 | | BlueMaxima joins |
| 03:45:47 | | qw3rty__ joins |
| 03:46:45 | | Video joins |
| 03:49:18 | | qw3rty_ quits [Ping timeout: 250 seconds] |
| 04:42:08 | | nico_32 quits [Remote host closed the connection] |
| 04:45:51 | | nico_32 (nico) joins |
| 04:54:53 | | jacobk quits [Client Quit] |
| 04:57:26 | | jacobk joins |
| 05:13:05 | | jacobk quits [Ping timeout: 244 seconds] |
| 05:13:06 | | jacobk_ joins |
| 05:28:58 | | AntiLiberal quits [Ping timeout: 252 seconds] |
| 06:38:18 | | fuzzy8021 quits [Ping timeout: 250 seconds] |
| 07:02:58 | <h2ibot> | Tech234a edited YouTube (+80, /* Older unlisted videos (July 2021) */ Add…): https://wiki.archiveteam.org/?diff=47020&oldid=47013 |
| 07:17:00 | <h2ibot> | Tech234a edited YouTube (+819, /* Older unlisted videos (July 2021) */ Add…): https://wiki.archiveteam.org/?diff=47021&oldid=47020 |
| 07:25:09 | | fuzzy8021 (fuzzy8021) joins |
| 07:29:26 | | AlsoHP_Archivist quits [Ping timeout: 250 seconds] |
| 08:11:16 | | graf_ joins |
| 08:13:12 | | grafck quits [Ping timeout: 250 seconds] |
| 08:18:31 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 08:33:39 | | Sanqui_ is now known as Sanqui |
| 08:33:39 | | Sanqui is now authenticated as Sanqui |
| 08:59:55 | <@OrIdow6> | Is there a channel for Google Drive? |
| 09:51:32 | | Arcorann (Arcorann) joins |
| 10:46:14 | | Megame (Megame) joins |
| 10:58:44 | | pabs quits [Ping timeout: 244 seconds] |
| 11:04:02 | | pabs (pabs) joins |
| 11:38:33 | | Arcorann quits [Remote host closed the connection] |
| 11:58:32 | | Wayward- quits [Ping timeout: 250 seconds] |
| 12:08:16 | | mutantmonkey quits [Remote host closed the connection] |
| 12:08:31 | | mutantmonkey (mutantmonkey) joins |
| 12:44:03 | <rewby> | thuban: Your regex doesn't produce any results. (And I've scanned the whole dataset) |
| 12:47:02 | <thuban> | fuck, i left an asterisk out. '(file|image):\s*"([^"]*)",' |
| 12:49:55 | | Arcorann joins |
| 12:49:55 | | Arcorann is now authenticated as Arcorann |
| 13:00:11 | <rewby> | thuban: Yeah. I figured that. Do you care about having file and image separate, or do you just want one big list? |
| 13:03:13 | <thuban> | separate, if that wouldn't require any effort on your part; otherwise together |
| 13:03:42 | <thuban> | (i _think_ we already got all the thumbnails in the regular ab run, but i need to check) |
| 13:04:57 | <rewby> | Cool. That's easy. My system doesn't really do capture groups so I have to do a second pass to get the urls out of the 'image: "<url>"' strings |
| 13:05:19 | <rewby> | It gives me a big list of regex matches per warc |
| 13:05:28 | <rewby> | And then I post-process from there |
| 13:07:07 | <rewby> | I also only process text/<whatever> and application/json entries. I don't match on image or video files, for obvious reaons |
| 13:07:09 | <rewby> | *reasons |
| 13:09:57 | <rewby> | thuban: I've updated the regexes and am re-running. It looks to be obtaining urls. |
| 13:10:54 | <thuban> | i actually tested it this time, haha |
| 13:11:36 | <rewby> | Cool |
| 13:11:45 | <rewby> | I'll get you a sample of one file just to check it by you |
| 13:26:21 | <rewby> | thuban: Here's a sample from one of the warc files: https://transfer.archivete.am/w7DSd/file.txt https://transfer.archivete.am/orEBX/image.txt |
| 13:26:31 | <rewby> | This look good to you? |
| 13:27:11 | <thuban> | yep! |
| 13:27:37 | <rewby> | Cool. Still processing the rest. But I'm doing this singlethreaded because I'm lazy. |
| 13:28:06 | <rewby> | It's got maybe 10 minutes left |
| 13:29:10 | | qwertyasdfuiopghjkl joins |
| 13:32:31 | <rewby> | thuban: It's a midly hacked up extractor, but it's doing the job. https://s3.services.ams.aperture-laboratories.science/rewby/public/2a4b8143-8fbb-406f-8880-503b8032405f/1627911131.0666816.png |
| 13:34:17 | <thuban> | nice |
| 13:37:54 | <rewby> | thuban: All done! https://transfer.archivete.am/15RnqL/file.txt https://transfer.archivete.am/CHsW5/image.txt |
| 13:55:13 | | AntiLiberal joins |
| 14:18:11 | | Guest69 joins |
| 14:29:35 | <thuban> | rewby: for some reason i'm getting only 86 unique urls from either of those files when there should be many more. |
| 14:29:45 | <thuban> | for example: https://app4.rthk.hk/special/rthkmemory/details/hk-footprints/108 is in the warcs, and running the (corrected) regex on that page yields 'file: "https://app4.rthk.hk/podcast/media/rthkmemory/b_v08.mp4",' and 'image: "https://rthkmemorycms.rthk.hk/photo/media/thumbnail/108",' |
| 14:29:53 | <thuban> | but 'https://app4.rthk.hk/podcast/media/rthkmemory/b_v08.mp4' is not in file.txt and 'https://rthkmemorycms.rthk.hk/photo/media/thumbnail/108' is not in image.txt. |
| 14:30:05 | <rewby> | Uh. Lemme check |
| 14:31:50 | <thuban> | idk what your plumbing looks like, but is it possible you ran one warc repeatedly instead of all the warcs? (24 warcs, 24 copies of each url i _do_ have) |
| 14:32:13 | <thuban> | (oh wait nvm, 25 warcs) |
| 14:33:02 | <rewby> | thuban: d'oh. I ran all the warcs, but I didn't concat the results properly |
| 14:33:11 | <rewby> | Lemme fix that |
| 14:33:46 | <thuban> | gotcha, thanks |
| 14:33:58 | <rewby> | thuban: How's this? https://transfer.archivete.am/JLRYd/file.txt https://transfer.archivete.am/sInqE/image.txt |
| 14:34:52 | <rewby> | Hm. Still not quite right I think |
| 14:36:56 | <rewby> | It's better but still not quite there |
| 14:39:00 | <thuban> | yeah... i do expect there to be a few copies of each result (each detail page has a base url and then two possible language parameters) but that's not what it looks like is happening |
| 14:40:14 | <rewby> | I'm double checking a few things. |
| 14:40:27 | <rewby> | Hmmm. |
| 14:40:34 | <rewby> | I wonder if we're dealing with an encoding problem |
| 14:41:19 | <rewby> | thuban: I'm doing another run with some tweaks that might help. |
| 14:41:49 | <rewby> | If you still find missing things, I'll have to go and manually dig into the warcs to see what wrong because that'll be a bug with my warc reader |
| 14:43:33 | <thuban> | i think to confirm anything missing i would have to manually download and zgrep the warcs--that other one was just a lucky spot-check |
| 14:43:47 | <rewby> | Fair enough |
| 14:43:56 | <rewby> | Just zgrepping doesn't always work |
| 14:44:10 | <thuban> | oh? |
| 14:44:26 | <rewby> | The problem is that warcs contain raw http responses. Which means your content can be encoded a number of ways. It's not uncomming to have a gzipped response or brotli compressed response |
| 14:44:34 | <thuban> | ah, yeah |
| 14:44:35 | <rewby> | *uncommon |
| 14:45:09 | <rewby> | There's a lot of screwery going on in this software to try and deal with this |
| 14:46:32 | <thuban> | i knew there was a reason i asked you instead of trying to do it myself ;) |
| 14:57:55 | | graf_ quits [Read error: Connection reset by peer] |
| 15:03:37 | | Arcorann quits [Client Quit] |
| 15:14:11 | <rewby> | thuban: Here's another attempt. I turned off all the "smart"ness. It should've gotten everything unless there was a decoding issue. https://transfer.archivete.am/ILUoM/file_unique.txt https://transfer.archivete.am/xBjlu/image_unique.txt |
| 15:26:49 | | AntiLiberal quits [Ping timeout: 252 seconds] |
| 15:28:54 | <thuban> | yeah, that's more consistent with what i was expecting |
| 15:35:38 | | Video quits [Ping timeout: 250 seconds] |
| 15:45:08 | <thuban> | huh... so it looks like archivebot successfully got everything (except a couple of m3u8s) in the original run. i wonder why playback doesn't work in the wbm? |
| 15:45:52 | <rewby> | Are there any POST requests involved? |
| 15:46:14 | <rewby> | Or maybe javascript that's unhappy? |
| 15:46:52 | <thuban> | lol, the only requests that fail are jwplayer's jwpsrv.js and sharing.js, which have somehow been double-rewritten: e.g. https://web.archive.org/web/20210728093807/https://web.archive.org/web/20210728093807/https://ssl.p.jwpcdn.com/6/8/jwpsrv.js . |
| 15:47:29 | <thuban> | (single-rewritten does exist in the archive and presumably would work.) |
| 15:47:59 | <rewby> | Huh. Interesting quirk |
| 15:54:16 | | driib quits [Client Quit] |
| 15:54:35 | | driib (driib) joins |
| 16:02:07 | <thuban> | https://web.archive.org/web/20210728093807js_/https://app4.rthk.hk/special/rthkmemory/assets/js/jwplayer/jwplayer.js |
| 16:05:20 | <thuban> | in the 'c.repo' function (which returns the base path jwplayer uses to get some assets) the url is rewritten once when the string literal with the original cdn's url is used, then again when the generated url string is munged for ssl |
| 16:07:21 | <thuban> | i guess there's no principled way to avoid this... |
| 16:25:44 | | LeGoupil joins |
| 16:31:59 | | jacobk_ quits [Ping timeout: 244 seconds] |
| 17:00:20 | | Guest69 quits [Client Quit] |
| 17:23:43 | | Video joins |
| 17:25:01 | | Ruthalas quits [Client Quit] |
| 17:25:21 | | Ruthalas (Ruthalas) joins |
| 17:30:22 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 17:34:04 | | h3ndr1k quits [Client Quit] |
| 17:42:50 | | h3ndr1k (h3ndr1k) joins |
| 17:51:43 | | Iki joins |
| 17:53:54 | | Wingy7 (Wingy) joins |
| 17:54:44 | | Wingy quits [Ping timeout: 250 seconds] |
| 17:54:44 | | Wingy7 is now known as Wingy |
| 17:55:00 | | AlsoHP_Archivist joins |
| 18:42:40 | | Mateon2 joins |
| 18:44:34 | | Mateon1 quits [Ping timeout: 250 seconds] |
| 18:44:34 | | Mateon2 is now known as Mateon1 |
| 18:52:53 | | CottonProphecy joins |
| 18:53:14 | | CottonProphecy is now authenticated as CottonProphecy |
| 19:27:25 | | qwertyasdfuiopghjkl joins |
| 19:56:35 | | CottonProphecy quits [Ping timeout: 244 seconds] |
| 19:57:03 | | GunDigger joins |
| 20:01:11 | | jacobk joins |
| 20:10:20 | | LeGoupil quits [Client Quit] |
| 20:13:45 | | CottonProphecy joins |
| 20:13:48 | | CottonProphecy is now authenticated as CottonProphecy |
| 20:17:36 | | G4te_Keep3r joins |
| 20:30:58 | | Video quits [Ping timeout: 252 seconds] |
| 20:38:21 | <wizards> | has anyone archived drivers, manuals, sdks and the like from canon's website? just figured i should ask before trying to archive it myself |
| 20:38:38 | <AK> | Got a link to the site and we can check? |
| 20:41:27 | <wizards> | example page for a specific camera: https://www.usa.canon.com/internet/portal/us/home/support/details/cameras/point-and-shoot-digital-cameras/slim-stylish-cameras/powershot-a2500/powershot-a2500?tab=drivers_downloads |
| 20:41:36 | <wizards> | and the place where i got that link from: https://www.usa.canon.com/internet/portal/us/home/support/drivers-downloads |
| 20:43:08 | <AK> | Hmm, I could give it a go in AB and see how it goes |
| 20:44:00 | <wizards> | might work for things like the reference photos, but the section that lists downloads uses js and probably would need manual work to scrape |
| 20:44:08 | <wizards> | i was writing a lua script to do exactly that |
| 20:45:21 | <AK> | Urgh, same for the manuals, it's all js |
| 20:45:42 | <AK> | It's all running through AB now anyway so we at least grab what we can |
| 20:48:23 | <thuban> | if you write your script to get the urls for the downloads, we can run that list through archivebot, too, so that at least the files will be in the wayback machine |
| 20:50:48 | <AK> | ^Forgot about that |
| 20:53:56 | | CottonProphecy quits [Ping timeout: 244 seconds] |
| 20:55:05 | <wizards> | in my list of urls, should i include the original urls or the ones they redirect to? since all of them are redirects |
| 20:55:10 | <wizards> | e.g. https://pdisp01.c-wss.com/gdl/WWUFORedirectTarget.do?id=MDMwMDAxMDYyODAx&cmp=ABR&lang=EN |
| 20:55:45 | <AK> | Original means we'll archive the redirect too |
| 20:55:46 | <thuban> | archivebot can follow redirects, so it's probably best to use the originals (since that way both will point to the file) |
| 20:55:57 | <thuban> | ^ what he said |
| 21:00:44 | <@JAA> | Uh |
| 21:01:03 | <@JAA> | Generally yes, but it depends. |
| 21:01:47 | <@JAA> | If all of the downloads behave like the above, i.e. the actual downloads are on a different host, it's fine. |
| 21:26:38 | | jacobk quits [Ping timeout: 250 seconds] |
| 22:25:42 | <h2ibot> | OrIdow6 edited Framasoft (+79, Correction on discovery source): https://wiki.archiveteam.org/?diff=47022&oldid=47014 |
| 22:35:39 | | AlsoHP_Archivist quits [Client Quit] |
| 22:38:28 | | HP_Archivist (HP_Archivist) joins |
| 23:05:34 | | Ruthalas8 (Ruthalas) joins |
| 23:05:57 | | Ruthalas quits [Read error: Connection reset by peer] |
| 23:05:57 | | Ruthalas8 is now known as Ruthalas |
| 23:51:36 | | Arcorann joins |
| 23:51:36 | | Arcorann is now authenticated as Arcorann |
| 23:51:43 | | BlueMaxima joins |