| 00:01:44 | | dm4v_ joins |
| 00:01:44 | | dm4v quits [Read error: Connection reset by peer] |
| 00:01:56 | | dm4v_ is now known as dm4v |
| 00:01:58 | | dm4v is now authenticated as dm4v |
| 00:01:58 | | dm4v quits [Changing host] |
| 00:01:58 | | dm4v (dm4v) joins |
| 00:47:35 | | dasineura joins |
| 00:48:11 | <dasineura> | https://www.23andmeforums.com/discussions shutting down 9/6/21. site requires login to access, wondering if it can still be archived? |
| 00:48:28 | <dasineura> | " - On September 6, 2021, the forum content will be deleted and 23andmeforums.com will redirect to 23andme.com" from the website |
| 01:02:29 | | dm4v quits [Read error: Connection reset by peer] |
| 01:03:08 | | dm4v joins |
| 01:03:10 | | dm4v is now authenticated as dm4v |
| 01:03:10 | | dm4v quits [Changing host] |
| 01:03:10 | | dm4v (dm4v) joins |
| 01:08:02 | | lennier2 joins |
| 01:10:36 | | lennier1 quits [Ping timeout: 250 seconds] |
| 01:10:43 | | lennier2 is now known as lennier1 |
| 01:45:10 | <Jake> | alliew: mediatype:web would be perfect, and community data is the correct collection I believe, but it will go to the WARCzone automatically ( https://archive.org/details/warczone ). the Internet Archive isn't accepting WARCs for ingestion into the WBM from anyone but a very specific set of whitelisted users right now. |
| 01:46:22 | <alliew> | aight! archiveteam keyword yay or nay? |
| 02:05:17 | <Jake> | I think it's fine? |
| 02:18:46 | <@OrIdow6> | AFAIK it doesn't affect anything |
| 02:26:18 | | ThreeHM quits [Ping timeout: 244 seconds] |
| 02:28:33 | | ThreeHM (ThreeHeadedMonkey) joins |
| 02:47:29 | <alliew> | uploading ^^ |
| 02:48:02 | <alliew> | ultimate-guitar has had a "80% off on pro accounts" banner for like a year and i am Nervous about it's financial situation because of it |
| 02:48:59 | <@OrIdow6> | dasineura: Generally the answer is no for login-only stuff like that |
| 02:49:50 | <dasineura> | yea was afraid of that |
| 02:54:22 | <@OrIdow6> | alliew: Is ultimate-guitar one of these things you uploaded? |
| 02:54:30 | | dasineura leaves |
| 02:54:34 | <@JAA> | Translation: it can be done and has happened before, but it will generally never go into the Wayback Machine. Also, making the data publicly available is only really an option if anyone can get access to the forums with a simple registration and no other barriers, and even then it's murky. |
| 02:54:39 | <@JAA> | Welp |
| 02:55:24 | <alliew> | OrIdow6, nope. i'm interested in archiving it |
| 02:55:43 | <Frogging101> | So my misterpoll won't go into WBM, then |
| 02:55:50 | <Frogging101> | right? |
| 02:55:56 | <alliew> | their paging doesn't go past 100 pages but they have a static artists list |
| 02:56:04 | <alliew> | so i think i can scrape all the tab urls through that |
| 02:57:12 | <Frogging101> | "making the data publicly available is only really an option if anyone can get access to the forums with a simple registration and no other barriers, and even then it's murky." |
| 02:57:14 | <Frogging101> | what does this mean |
| 02:57:25 | <Frogging101> | does it mean that if it wasn't public before, it's unethical to make it public? |
| 02:57:36 | <@OrIdow6> | For me it says the sale ends in 4 hours, is that just a gimmick? |
| 02:58:48 | <@OrIdow6> | Or is the end of a months long timer really in 4 hours? |
| 03:00:15 | <@OrIdow6> | Anyhow, if you think there is a serious risk of it going down or if it is small and/or simple and/or especially important (e.g. political) ArchiveTeam may be able to take it on itself |
| 03:00:18 | <@JAA> | Frogging101: Basically yeah. Only public stuff goes into the WBM. If there's a login wall that anyone can bypass simply by creating an account, it's effectively public, but it still generally wouldn't go into the WBM. If there's any barrier beyond that (say, subscription fee to access part of the site, like on SomethingAwful, or something like reputation/minimum number of posts/etc.), the data |
| 03:00:24 | <@JAA> | probably shouldn't be public at all. |
| 03:00:55 | | alliew quits [Ping timeout: 244 seconds] |
| 03:01:32 | <@OrIdow6> | Missed everything I said to them, of course |
| 03:01:48 | <@JAA> | As is tradition. |
| 03:02:34 | | alliew joins |
| 03:02:45 | <@OrIdow6> | Oh |
| 03:02:59 | <@JAA> | Frogging101: Doesn't mean such things shouldn't be archived, of course. It's the data sharing/publication that's the problematic side of things. |
| 03:03:07 | <@OrIdow6> | alliew: For me it says the sale ends in 4 hours, is that just a gimmick, or is the end of a months long timer really in 4 hours? Anyhow, if you think there is a serious risk of it going down or if it is small and/or simple and/or especially important (e.g. political) ArchiveTeam may be able to take it on itself |
| 03:03:25 | <alliew> | oh, they just keep changing the banner lol |
| 03:03:29 | <Frogging101> | If it's archived but nobody is allowed to see it then how useful is it? |
| 03:05:14 | <@OrIdow6> | Because "nobody is allowed to see it" is presumed to be relaxed in the future or under special circumstances |
| 03:05:30 | <Frogging101> | true enough |
| 03:05:47 | <alliew> | as for the importance, i find it pretty important as a resource |
| 03:05:59 | <alliew> | i'll probably have a url scrape going tomorrow |
| 03:06:10 | <alliew> | it should be around like, 1.7 million? |
| 03:06:16 | <alliew> | which with http2 is doable solo |
| 03:07:16 | <@JAA> | WARC doesn't support HTTP/2 though. |
| 03:08:18 | <alliew> | most HTTP/2 responses you get are HTTP/1.1 valid |
| 03:08:34 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+190, /* 2021 */ Add 23andMe forums): https://wiki.archiveteam.org/?diff=47092&oldid=47087 |
| 03:08:37 | <@JAA> | Er, no, because they say 'HTTP/2'. |
| 03:08:46 | <alliew> | i mean, yeah |
| 03:08:53 | <alliew> | i'm saying the actual content |
| 03:08:58 | <alliew> | fits the HTTP/1.1 spec |
| 03:09:05 | <alliew> | though the transport is different |
| 03:09:17 | <@JAA> | Yeah, but the status line doesn't, so it isn't allowed by the WARC specifications. |
| 03:09:26 | <nicolas17> | yeah so it can be converted, doesn't mean it's simply compatible |
| 03:09:31 | <@JAA> | (Please don't write non-compliant or faked WARCs.) |
| 03:09:31 | <alliew> | yep |
| 03:10:37 | <alliew> | this is a personal take, as someone who works in archival, so yknow, grains of salt |
| 03:11:42 | <alliew> | but personally, if it makes saving data where it's possible while keeping all the content fine, i'm pro? |
| 03:12:00 | <alliew> | like, if it's a convertible response, HTTP/2 allows for *really* fast scraping |
| 03:13:23 | <@JAA> | Let's get the performance argument out of the way: I'm regularly grabbing hundreds of responses per second from a single slowish machine over HTTP/1.1. (All you need is lots of parallel connections.) |
| 03:14:18 | <alliew> | i've found HTTP/2 multiplexing + multithreading significantly faster than conn pooling |
| 03:14:28 | <@JAA> | The issue is a different one. These WARC files are supposed to stay around for decades (or hopefully 'forever'). The further they stray from the official standard, the messier things get down the line. So it's best to stick to the standard. And if that's not possible, one should work on extending the standard, like we've done with the zstd compression for example. |
| 03:14:45 | <alliew> | oh yeah, agreed on that |
| 03:15:26 | <@JAA> | Since the standard specifically refers to HTTP/1.1 everywhere, that's all that's allowed, really. Technically. you can't even archive HTTP/0.9 or /1.0 I think (which I guess should be fixed as well). |
| 03:15:27 | | alliew quits [Read error: Connection reset by peer] |
| 03:15:52 | | alliew joins |
| 03:16:15 | <nicolas17> | alliew: what's the last message you saw? >.> |
| 03:16:43 | <alliew> | changing a status line for a compatible response, with proper flagging, personally to me keeps to both the standard in usability and preservation, again, only if it's straight up compatible *except for* the status line |
| 03:16:45 | <alliew> | uhh |
| 03:16:50 | <alliew> | my "oh yeah, agreed on that" |
| 03:17:03 | <@JAA> | 03:15:26 <@JAA> Since the standard specifically refers to HTTP/1.1 everywhere, that's all that's allowed, really. Technically. you can't even archive HTTP/0.9 or /1.0 I think (which I guess should be fixed as well). |
| 03:17:29 | <@JAA> | 'Proper flagging' isn't possible within the standard either. |
| 03:17:45 | <@JAA> | Really WARC should get HTTP/2 support (and /3 in the future). Everything else is insanity. |
| 03:17:58 | <alliew> | if your archive's staying around i sure goddamn hope your metadata is staying around |
| 03:18:13 | <alliew> | just for the sake of the intern in a century who's having to re-index it :p |
| 03:18:15 | <nicolas17> | uh how do I tell WBM to archive a page now? it already exists in WBM, it's just old |
| 03:18:45 | <nicolas17> | IIRC there's a "save page now" button that appears when a page *isn't* in WBM yet |
| 03:18:49 | <@JAA> | I meant a clean way of marking such records, e.g. in a WARC header. |
| 03:19:14 | <Frogging101> | JAA: what are the chances of my misterpoll grab getting mediatype: web |
| 03:19:24 | <@JAA> | But well, I still think such transformations should always be avoided entirely. |
| 03:19:41 | <alliew> | i guess it comes down to letter-of-the-law or not, right |
| 03:20:12 | <@OrIdow6> | nicolas17: web.archive.org/save |
| 03:20:34 | <@JAA> | WARCs are already a huge mess to correctly process due to ambiguities in the standard and everyone ignoring the standard (cf. the payload digest debacle). Let's not make it worse. |
| 03:21:00 | <alliew> | oh, by the way y'all |
| 03:21:33 | <@JAA> | As I said, performance-wise, meh, not a big deal. It's rare to find sites that happily handle thousands of requests per second from one IP anyway. |
| 03:21:52 | <Frogging101> | JAA: speaking of payload hash, this never got merged :( https://github.com/ArchiveTeam/wpull/pull/360 |
| 03:22:15 | <alliew> | if you're interested in a nightmare to archive, may i suggest the Hemeroteca Digital Brasileira? it's brazil's online newspaper archive, and has recently come down for days at a time. it's also ASPX and written so badly,, |
| 03:22:26 | <@JAA> | Frogging101: You can set the mediatype yourself. That's not an issue. It's also not the only factor for ingestion into the WBM though. |
| 03:22:41 | <alliew> | i've worked on scraping it before, and jesus christ i want to talk to whoever wrote this |
| 03:23:08 | <alliew> | just ask if they're ok. if ASPX has them bound to it through a curse that forces them to write in it. |
| 03:23:54 | <alliew> | JAA: yeah, i get your position on that |
| 03:23:54 | <@JAA> | Frogging101: Ugh yeah, right, that one. Reminds me of the warcio bugs I recently discovered. |
| 03:27:27 | <@JAA> | alliew: Oh, that clusterfuck. I remember looking at that briefly when the Museu Nacional went up in flames, I think. |
| 03:27:42 | <alliew> | god, riv museu nacional |
| 03:27:56 | <alliew> | being an archivist in this country is Pain |
| 03:28:08 | <alliew> | our film archive caught fire too recently |
| 03:28:36 | <@JAA> | Oof. I missed that. |
| 03:29:24 | <alliew> | didn't get as much stuff as the museu nacional fire but was still painful |
| 03:30:10 | <nicolas17> | yikes |
| 03:34:16 | <alliew> | oh also doing a wget --mirror warc grab of a-infos, found it on the fire drill page |
| 03:43:27 | <alliew> | anyways, gotta sleep and i haven't set my IRC bouncer back up yet. have a g'night |
| 03:43:31 | | alliew quits [Client Quit] |
| 03:45:41 | | qw3rty_ joins |
| 03:49:12 | | qw3rty__ quits [Ping timeout: 250 seconds] |
| 04:32:25 | <@OrIdow6> | Anyone know of any Google Docs filetypes that are neither documents, spreadsheets, presentations, sites, nor regular (bytestring) files? |
| 04:33:28 | <@OrIdow6> | And technically folders |
| 04:34:55 | <@OrIdow6> | Apparently there's also drawings, tables, and forms, don't know if those are downloadable or not |
| 04:35:57 | <@OrIdow6> | Whatever the case, I intend to write this conservatively enough that if an unexpected type comes up, it will fail and make that known |
| 05:06:59 | | benjins quits [Ping timeout: 244 seconds] |
| 05:22:13 | | nicolas17 quits [Ping timeout: 252 seconds] |
| 05:36:28 | | BlueMaxima_ joins |
| 05:40:22 | | BlueMaxima quits [Ping timeout: 252 seconds] |
| 06:13:07 | | BlueMaxima_ quits [Ping timeout: 244 seconds] |
| 10:25:05 | | benjins joins |
| 10:25:18 | | benjins is now authenticated as benjins |
| 11:44:35 | | Iki joins |
| 11:55:27 | | Iki1 joins |
| 11:56:29 | | gurubob joins |
| 11:59:17 | | Iki quits [Ping timeout: 244 seconds] |
| 12:01:01 | | gurubob quits [Remote host closed the connection] |
| 12:42:30 | | knecht420 quits [Quit: The Lounge - https://thelounge.chat] |
| 12:42:45 | | knecht420 (knecht420) joins |
| 12:54:56 | | nertzy__ joins |
| 13:11:54 | | nertzy__ quits [Client Quit] |
| 13:26:47 | | Megame quits [Client Quit] |
| 14:34:07 | | qwertyasdfuiopghjkl joins |
| 14:54:45 | | jonst123 joins |
| 15:05:17 | | Arcorann quits [Ping timeout: 244 seconds] |
| 15:13:28 | | hexa- quits [Quit: WeeChat 3.1] |
| 15:14:56 | | hexa- (hexa-) joins |
| 15:15:07 | <@OrIdow6> | arkiver: Any updates on the review of google-drive-grab? |
| 16:03:54 | | Gereon6 (Gereon) joins |
| 16:16:00 | <@JAA> | For visibility: ArchiveBot is currently broken and cannot accept new jobs for the time being. |
| 16:26:54 | | alliew (alliew) joins |
| 16:37:54 | | nicolas17 joins |
| 17:03:39 | <@arkiver> | OrIdow6: net yet, in a few hours i hope |
| 17:05:32 | <@arkiver> | checking it now |
| 17:06:52 | <@arkiver> | nice one on the two checks in checkip |
| 17:08:30 | <@arkiver> | OrIdow6: do you have items? |
| 17:09:33 | <@arkiver> | for a size estimate, i dont think we have to go through all folder first |
| 17:10:13 | <@arkiver> | we simply take a sample out of the complete pool of folder and files, and check the total size after archiving those |
| 17:11:01 | <@arkiver> | (assuming not many files in the folders are listed in our raw list outside of the folders) |
| 17:16:51 | <@arkiver> | OrIdow6: why are you adding an ignore for protocol_and_domain_and_port? |
| 17:17:00 | <@arkiver> | line 91-94 |
| 17:20:06 | <@OrIdow6> | arkiver: I have test items, I do not have item list yet |
| 17:23:27 | <@OrIdow6> | You're right it's wrong, I think that line in add_ignore may be missing a $ at the end |
| 17:28:30 | <@OrIdow6> | I've also noticed I'm not checking status_code in "if req_callbacks[url] ~= nil then" |
| 17:29:55 | <@arkiver> | could be, still checking it |
| 17:31:08 | <@arkiver> | OrIdow6: why do we return false in allowed on start_urls_inverted[url] ? |
| 17:33:28 | | LeGoupil joins |
| 17:34:40 | <@OrIdow6> | arkiver: Because if start_urls_inverted[url] is true, url is in start_urls, which I'm using to set new items |
| 17:34:54 | <@OrIdow6> | So there's a risk of setting a new item if start_urls_inverted[url] |
| 17:37:38 | <@arkiver> | right, yeah i see allowed is not used to check if a URL is in-scope in httploop_result |
| 17:37:43 | <@arkiver> | (because all URLs are i believe) |
| 17:43:24 | <@OrIdow6> | Yes, I'm trying to have more detailed control over the retry process |
| 17:59:00 | <@OrIdow6> | And I will say I have not bothered with previews, there are something like 8 different types I'd have to write and they are of marginal benefit |
| 18:01:30 | | qwertyasdfuiopghjkl31 joins |
| 18:01:59 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 18:02:22 | | qwertyasdfuiopghjkl31 is now known as qwertyasdfuiopghjkl |
| 18:41:37 | | alliew quits [Client Quit] |
| 18:46:23 | <@arkiver> | alright |
| 19:02:53 | <@arkiver> | OrIdow6: what are your test items? |
| 19:03:05 | <@arkiver> | nice one on the POST requests :) |
| 19:09:43 | <systwi> | JAA: What broke with ArchiveBot? |
| 19:10:09 | <Jake> | some Redis issue I believe. |
| 19:21:39 | | lennier1 quits [Client Quit] |
| 19:21:49 | <spirit> | maybe use a webscale database next time? |
| 19:24:42 | <@OrIdow6> | arkiver: folder:1r8I5hpSPCf_9JWECwa6c4E4tQZELd3cx folder:1oCMgJeBc55NuEasPcgwjx2FuPdQd8neu folder:0B7z5EDsKyEsGfkEybGh2Y0tuc0dpMTVCbDZ4N1RXTGZMbnhwWEZqcnJmMzVYcy10SEplSlE |
| 19:25:37 | <@OrIdow6> | As well as subfolders etc. if you want more |
| 19:32:00 | | onetruth joins |
| 19:37:32 | <pcr> | OrIdow6: You might want to add in something for google collab notebooks, those are stored in drive |
| 19:38:40 | <pcr> | Sample item: https://colab.research.google.com/drive/11z58bl3meSzo6kFqkahMa35G5jmh2Wgt |
| 19:42:26 | | lennier1 (lennier1) joins |
| 19:45:58 | <pcr> | You might also need handling for Jamboard |
| 19:50:41 | <@OrIdow6> | Thank you pcr, do you have an example of the second? |
| 19:55:26 | | wessel1512 is now authenticated as wessel1512 |
| 20:04:57 | <pcr> | https://jamboard.google.com/d/1uJgeZf69HWVLATuEKjrAHCyDW1ADPmvDPJkjrHs6-j8/edit |
| 20:16:14 | <@OrIdow6> | Thanks |
| 20:16:33 | <@OrIdow6> | FYI it looks like colab will not need special handling, but jamboard will need a bit |
| 20:22:09 | <@OrIdow6> | Able to download what I assume is a canonical JSON vs export as PDF |
| 20:25:59 | <@OrIdow6> | If I have time may try to work on the web interface directly, but I think it's unlikely it will work |
| 20:37:11 | | dewdrop quits [Quit: The Lounge - https://thelounge.chat] |
| 20:39:50 | <@JAA> | systwi: Redis broke. Check your logs in #archivebot for the details. |
| 20:40:35 | | onetruth quits [Read error: Connection reset by peer] |
| 20:42:34 | | dewdrop (dewdrop) joins |
| 20:51:08 | <@arkiver> | lets get a google drive channel going! |
| 20:51:09 | <@arkiver> | any ideas? |
| 20:51:34 | <@JAA> | googlecrash |
| 20:51:46 | <@arkiver> | nice |
| 20:53:46 | <Jake> | i like it :) |
| 20:54:22 | <@arkiver> | #googlecrash it is |
| 21:28:17 | | LeGoupil quits [Client Quit] |
| 21:36:14 | | Stiletto quits [Read error: Connection reset by peer] |
| 21:36:28 | | Stiletto joins |
| 21:40:01 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 21:48:16 | | qwertyasdfuiopghjkl joins |
| 22:22:34 | | Megame (Megame) joins |
| 23:03:43 | | phiresky quits [Ping timeout: 252 seconds] |
| 23:06:05 | | phiresky joins |
| 23:31:08 | | Arcorann (Arcorann) joins |
| 23:35:45 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 23:46:36 | | dserve quits [Ping timeout: 244 seconds] |
| 23:49:42 | | BlueMaxima joins |