00:05:43 | | Arcorann (Arcorann) joins |
00:14:49 | | DogsRNice_ joins |
00:17:50 | | DogsRNice quits [Ping timeout: 240 seconds] |
00:40:02 | | jasons (jasons) joins |
00:46:36 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
01:02:45 | | DogsRNice__ joins |
01:04:24 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
01:06:47 | | DogsRNice_ quits [Ping timeout: 272 seconds] |
01:19:28 | | Mateon2 joins |
01:21:21 | | Mateon1 quits [Ping timeout: 272 seconds] |
01:21:21 | | Mateon2 is now known as Mateon1 |
01:35:55 | | jasons quits [Ping timeout: 272 seconds] |
01:48:54 | <h2ibot> | Pokechu22 edited Deathwatch (+122, /* 2024 */ virusradar.com (AB job in progress…): https://wiki.archiveteam.org/?diff=51577&oldid=51576 |
02:39:18 | | jasons (jasons) joins |
02:45:35 | | Naruyoko5 quits [Ping timeout: 272 seconds] |
02:49:23 | <@JAA> | The SAP Q&A/blog migration is complete. It looks like answers.sap.com started redirecting to community.sap.com at around 2024-01-25 21:35 (or at least that's when my 403s stopped and I got 301s instead). |
02:50:41 | <@JAA> | They have redirects in place, but not for all URLs that were valid previously. |
02:53:36 | <fireonlive> | :/ |
03:08:14 | | Naruyoko joins |
03:18:15 | | fuzzy8021 quits [Read error: Connection reset by peer] |
03:20:32 | | fuzzy8021 (fuzzy8021) joins |
03:21:10 | | fuzzy8021 quits [Read error: Connection reset by peer] |
03:21:36 | | fuzzy8021 (fuzzy8021) joins |
03:37:20 | | jasons quits [Ping timeout: 240 seconds] |
04:40:46 | | jasons (jasons) joins |
04:59:36 | | DogsRNice__ quits [Read error: Connection reset by peer] |
05:16:45 | | BlueMaxima quits [Client Quit] |
05:37:20 | | jasons quits [Ping timeout: 240 seconds] |
06:07:50 | <h2ibot> | Pokechu22 edited Rumble (+743, a bit of info on embeds and regular videos): https://wiki.archiveteam.org/?diff=51578&oldid=50648 |
06:38:26 | | cas joins |
06:39:44 | <cas> | https://www.reddit.com/r/DanmeiNovels/comments/19eld53/bilibili_comics_international_app_is_shutting/ I got words that bilibili comics is shutting down soon on Feb 29 2024 |
06:40:04 | <cas> | I wonder if AT can afford to save and archive its content |
06:40:48 | | jasons (jasons) joins |
06:42:45 | <cas> | https://www.reddit.com/r/Piracy/comments/19d8py9/a_bulgarian_videosharing_platform_is_going_to/ there's also this bulgarian videosharing platform that's going to delete everything it has in a month too. Here's the link to it. https://www.vbox7.com/ I recall I sent messages about this already tho |
06:43:31 | <cas> | Ok I see there's an article about the site, disregard my message about vbox7 |
07:25:24 | | cas quits [Remote host closed the connection] |
07:27:44 | | parfait (kdqep) joins |
07:36:50 | | jasons quits [Ping timeout: 240 seconds] |
07:42:09 | <h2ibot> | Pokechu22 edited Rumble (+64): https://wiki.archiveteam.org/?diff=51579&oldid=51578 |
07:56:33 | | Ruthalas59 quits [Ping timeout: 272 seconds] |
08:12:24 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
08:40:25 | | jasons (jasons) joins |
09:04:29 | | Ruthalas59 (Ruthalas) joins |
09:35:50 | | jasons quits [Ping timeout: 240 seconds] |
10:00:04 | | Bleo18260 quits [Client Quit] |
10:01:22 | | Bleo18260 joins |
10:39:20 | | jasons (jasons) joins |
10:42:23 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
11:06:50 | | programmerq quits [Ping timeout: 240 seconds] |
11:41:23 | | jasons quits [Ping timeout: 272 seconds] |
11:50:47 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
11:53:48 | | murmur quits [Read error: Connection reset by peer] |
11:53:50 | | murmur joins |
12:02:57 | | programmerq (programmerq) joins |
12:09:48 | | driib (driib) joins |
12:21:49 | | driib quits [Client Quit] |
12:22:52 | | driib (driib) joins |
12:44:07 | | jasons (jasons) joins |
13:09:50 | | Arcorann quits [Ping timeout: 240 seconds] |
13:12:56 | | h3ndr1k quits [Client Quit] |
13:14:56 | | h3ndr1k (h3ndr1k) joins |
13:15:18 | <h2ibot> | Bzc6p edited Vbox7 (+5, Status: this is more like a special case): https://wiki.archiveteam.org/?diff=51580&oldid=51564 |
13:23:07 | | Megame (Megame) joins |
13:44:53 | | jasons quits [Ping timeout: 272 seconds] |
14:21:12 | | Inti83 joins |
14:21:13 | <eggdrop> | [tell] Inti83: [2023-12-06T07:04:23Z] <thuban> all of the sites on the argentina wiki page have been submitted to archivebot; you can monitor running jobs at http://archivebot.com/ and retrieve finished ones from https://archive.fart.website/archivebot/viewer/ |
14:21:14 | <eggdrop> | [tell] Inti83: [2023-12-06T07:04:29Z] <thuban> note that a job succeeding does not necessarily mean the site was adequately captured (if eg there is heavy use of javascript) |
14:22:26 | <Inti83> | Hi there I am back just to let you know Educ.ar was taken down. I added the youtube channel to the wiki, just in case they take that down as well. Unfortunately the archived site seems to be broken, probably because of heavy use of JS :/ |
14:23:43 | <Inti83> | We downloaded quite a few sites with grab-site, but downloaders are a bit worried about the metadata giving away their identities. Do you know if there is a way of parsing warcs to change, for instance, the home directory? Without corrupting the integrity of the warc? |
14:24:15 | | Shjosan quits [Client Quit] |
14:24:31 | | Shjosan (Shjosan) joins |
14:33:25 | <thuban> | Inti83: hello! looks like your last edit is still in the moderation queue, but if you join #down-the-tube you can queue the youtube channel yourself |
14:35:23 | <thuban> | as for your question, are you concerned specifically about the directory paths in the warcinfo? |
14:44:04 | <yzqzss> | Hi, |
14:44:04 | <yzqzss> | We're archiving a Chinese painting app called 画吧(huabar).It will be shut down on 2024-02-08. |
14:44:19 | <yzqzss> | It has a total of ~19,000,000 valid painting ids(noteid). the project files and images of the paintings add up to 10-13 TiB. we have downloaded 70% of them, and the rest will be done in 3 days. |
14:44:38 | <yzqzss> | We are considering uploading archive to IA.This is technically possible, but we are not sure if the 10 TiB+ data is acceptable for IA? Any experience/suggestions? |
14:45:14 | <yzqzss> | https://wiki.saveweb.org/画吧 |
14:45:14 | <yzqzss> | https://wiki.saveweb.org/en:画吧 |
14:48:06 | | jasons (jasons) joins |
14:57:25 | | inti8365 joins |
14:57:35 | | inti8365 quits [Remote host closed the connection] |
14:58:24 | <Inti83> | @thub |
14:58:45 | <Inti83> | thuban: hi, sorry am not used to irc |
14:59:08 | <Inti83> | Yes, I traced down the date of the warc the bot created and then looked up that date |
14:59:14 | <Inti83> | in archive.org |
14:59:19 | <Inti83> | I'm not sure if that's how it works |
14:59:28 | <Inti83> | I didn't download the warcs |
14:59:32 | | Iki1 joins |
15:00:00 | <Inti83> | But most of the front-page links are broken. I should look into the actual warc that the bot created, perhaps try it out on replay.page |
15:01:49 | <yzqzss> | <yzqzss> "Hi,..." <- > <@yzqzss:matrix.org> Hi, |
15:01:50 | <yzqzss> | > We're archiving a Chinese painting app called 画吧(huabar).It will be shut down on 2024-02-08. |
15:01:50 | <yzqzss> | edit: it's a painting app from China, not "Chinese painting" app |
15:03:25 | | Iki quits [Ping timeout: 272 seconds] |
15:04:57 | <thuban> | Inti83: the job generated a lot of data (https://archive.fart.website/archivebot/viewer/job/2023120605514614pkg), so hopefully the relevant pages have been saved even if the front-page links don't work; you may be able to find them through the menus |
15:06:16 | <Inti83> | thuban: cool, yeah I saw all the files, will check thanks! |
15:06:20 | | Megame quits [Ping timeout: 240 seconds] |
15:07:03 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
15:10:25 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
15:10:26 | | qwertyasdfuiopghjkl quits [Excess Flood] |
15:11:45 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
15:18:45 | <h2ibot> | Switchnode edited Deathwatch (+412, /* 2024 */ add huabar): https://wiki.archiveteam.org/?diff=51581&oldid=51577 |
15:26:00 | <thuban> | yzqzss: nice work! your question is probably more appropriate for #internetarchive, but my _guess_ is that if you contact them in advance it will be fine |
15:26:46 | <thuban> | (a lot of archiveteam projects are around that size; some are much bigger) |
15:31:02 | | Inti83 quits [Ping timeout: 265 seconds] |
15:47:45 | | jasons quits [Ping timeout: 272 seconds] |
15:55:14 | <yzqzss> | thuban: OK, I'll contact info@IA. |
16:03:25 | <@JAA> | yzqzss: Talk to arkiver about it! :-) |
16:05:34 | <@JAA> | !tell Inti83 There is no good tooling for editing WARCs (warcio might look like it, but stay away from that!), and attempting to do so is generally strongly recommended against precisely because it's so easy to corrupt something. Changing the warcinfo record would require rewriting the entire WARC(s). |
16:05:35 | <eggdrop> | [tell] ok, I'll tell Inti83 when they join next |
16:06:47 | <thuban> | JAA: so you're telling me i shouldn't pop it open in my text editor? :) |
16:07:38 | <@JAA> | thuban: Might still be a better option than warcio. :-) |
16:08:25 | <thuban> | probably is, considering the header-rewriting stuff |
16:08:38 | <yzqzss> | <JAA> "yzqzss: Talk to arkiver about it..." <- yeap, I PMed arkiver days ago, but he was traveling (https://irclogs.archivete.am/archiveteam-bs/2024-01-26#lf158747a) and I didn't receive a reply. lol. |
16:08:50 | <@JAA> | Assuming your editor doesn't try to convert to UTF-8 or similar things, at least, yeah. |
16:09:25 | <@JAA> | yzqzss: Right, yeah, just try again. :-) You'll have more success than with info@ in my experience. |
16:15:51 | <thuban> | in all seriousness i did almost suggest that, with appropriate limitations and caveats. but i didn't want to jump the gun when i wasn't entirely clear on what was being asked... |
16:16:15 | <thuban> | (did you know about http://purl.org/dc/terms/provenance ? :o) |
16:45:10 | <@arkiver> | yzqzss: i missed your message indeed |
16:50:44 | | jasons (jasons) joins |
16:53:32 | <@arkiver> | (it was through discord) |
16:54:06 | <@arkiver> | just as a note to everyone - the most reliable way to reach me is at arkiver@protonmail.com , or IRC. if I don't reply on something like Discord, please send me an email instead |
17:00:31 | | Inti83 joins |
17:00:32 | <eggdrop> | [tell] Inti83: [2024-01-27T16:05:34Z] <JAA> There is no good tooling for editing WARCs (warcio might look like it, but stay away from that!), and attempting to do so is generally strongly recommended against precisely because it's so easy to corrupt something. Changing the warcinfo record would require rewriting the entire WARC(s). |
17:11:15 | | Megame (Megame) joins |
17:13:26 | | bladem quits [Quit: Leaving] |
17:38:11 | | EmeraldSnorlax|m is now known as rain|m |
17:42:43 | <Inti83> | eggdrop: if a site is now archived in arhive.org, how can we use the warcs? Can we download them to use them or process them? Like with warc2html if needed? Is there any way to navigate the warc files? |
17:43:02 | | Shjosan quits [Client Quit] |
17:43:32 | | Shjosan (Shjosan) joins |
17:50:21 | <Inti83> | Another question, if the bot creates this warc www.educ.ar-inf-20231206-055146-14pkg-00000.warc.gz Does this correspond with an entry de for 6th ofdecember 5.51 in archive.org? |
17:52:09 | <nulldata> | Inti83 - eggdrop is a bot relaying a message from JAA made while you were gone |
17:52:38 | <Inti83> | Ahh thanks for letting me know nulldata |
18:04:09 | <@JAA> | thuban: Yeah, there are lots of issues with the approach (e.g. also line endings), so I would never recommend it, but it can work. Compression is a different beast though. |
18:05:18 | <thuban> | mmm |
18:06:06 | <@JAA> | Inti83: You can download ArchiveBot WARCs, sure. I'm not familiar with the quirks of the tooling to dump the contents into static files or similar, except that it's a bit of a pain. Personally, when I've needed WARC playback locally, I've used pywb in the past, which worked well enough (although it inherits warcio's problems, but if you don't care too much about header accuracy, it's acceptable for |
18:06:12 | <@JAA> | playback). |
18:06:57 | <@JAA> | AB job data is spread over items in the collection, but you can find all the files for a job through the AB viewer. |
18:07:15 | <@JAA> | In this case: https://archive.fart.website/archivebot/viewer/job/2023120605514614pkg |
18:11:27 | <thuban> | the quickest way to play back a warc is https://replayweb.page/; i believe unar/"The Unarchiver" can extract files if you need to do that for some reason |
18:14:07 | <@JAA> | Will it rewrite the href and src attributes to match the structure after extraction though? |
18:15:57 | | Megame quits [Ping timeout: 272 seconds] |
18:17:26 | | Megame (Megame) joins |
18:18:10 | <thuban> | no, it extracts files. |
18:18:22 | | Megame quits [Remote host closed the connection] |
18:19:39 | | Megame (Megame) joins |
18:21:37 | <@JAA> | Yeah, as expected. |
18:21:54 | <@JAA> | So it's virtually useless for playback unless it's all media or similar. |
18:22:12 | <thuban> | that's why i didn't recommend it for playback! |
18:22:36 | <nicolas17> | Inti83: WARCs that were archived using archivebot are available on wayback machine, so it's rare that you need to download them and extract them yourself |
18:22:45 | <@JAA> | The entire WARC ecosystem is so awkward to work with... |
18:25:36 | | Shjosan quits [Client Quit] |
18:25:52 | | Shjosan (Shjosan) joins |
18:29:14 | | Inti83 quits [Remote host closed the connection] |
18:34:12 | | katia_ (katia) joins |
18:34:49 | | katia_ quits [Remote host closed the connection] |
18:45:12 | | katia_ (katia) joins |
18:46:22 | | Megame quits [Remote host closed the connection] |
18:46:44 | | Megame (Megame) joins |
19:01:07 | | decky_e_ quits [Read error: Connection reset by peer] |
19:01:10 | | katia_ quits [Remote host closed the connection] |
19:01:27 | | decky_e_ joins |
19:02:34 | | decky joins |
19:03:22 | | nyany_ quits [Quit: (516): and then you went into taco bell without pants...and surprisingly you weren't the only one there without pants] |
19:03:32 | | nyany (nyany) joins |
19:05:50 | | decky_e_ quits [Ping timeout: 240 seconds] |
19:09:25 | | itachi1706 quits [Quit: Bye :P] |
19:11:15 | | itachi1706 (itachi1706) joins |
19:26:34 | | Wohlstand (Wohlstand) joins |
19:35:42 | | Megame quits [Client Quit] |
19:43:36 | | alpine joins |
19:44:50 | | jasons quits [Ping timeout: 240 seconds] |
19:55:39 | | alpine quits [Remote host closed the connection] |
19:59:09 | | Shrinks99 joins |
19:59:34 | <Shrinks99> | Gah, apologies, maybe there's a few other links on the wiki that should be updated to this channel :P |
19:59:44 | | Alyssa joins |
19:59:50 | <Alyssa> | BS = bullshit? |
20:00:17 | <Alyssa> | Can we substitute "probabilistic epistemology"? |
20:01:54 | <thuban> | we've discussed changing the channel names. but perhaps it _would_ make sense to change that front page link... |
20:02:14 | <Alyssa> | Hi, thuban :) |
20:02:28 | <Alyssa> | I wanna help you guys. |
20:02:35 | <Alyssa> | I have a few ideas. |
20:02:54 | <Alyssa> | Wanna wget -mb some stuff with me? |
20:03:28 | <Alyssa> | Btw, I know this was started in January 2009 by SketchCow. |
20:03:48 | <Alyssa> | I think we can really make the Internet Archive famous |
20:04:01 | <Alyssa> | It's the only reliable way to establish intellectual property. |
20:04:27 | <@JAA> | Either 'bullshit' or 'bikeshed', depending on who you ask. |
20:04:40 | <Alyssa> | JAA: Hi *-* |
20:04:54 | <@JAA> | It used to be the offtopic channel, nowadays it's the archival discussion channel, and -ot is for offtopic. |
20:05:03 | <Alyssa> | Ok. |
20:05:33 | <Alyssa> | Basically, I think we should archive music that's in danger of being deleted off of youtube. |
20:05:43 | <Alyssa> | The problem, of course, is copyright law |
20:05:58 | <Alyssa> | And despite my autodidactic BA in legalese |
20:06:04 | | JC|m leaves |
20:06:06 | <Alyssa> | I can't really find the right way to do it. |
20:06:11 | <thuban> | Shrinks99: we don't have any such rules that i'm aware of, although perhaps people have Opinions. what did you want to update? |
20:06:43 | <Shrinks99> | Better description for ReplayWebpage, docs link, contributor count update |
20:07:17 | <thuban> | go right ahead, imho |
20:07:21 | <Shrinks99> | Would also want to add ArchiveWeb.page, our extension for interactive archiving of pages in Chrome |
20:07:22 | <Alyssa> | https://www.youtube.com/watch?v=5EZRA-KQx58 |
20:08:18 | <@JAA> | Yeah, I'm curious if that can actually comply with the WARC spec, but I haven't taken a close look at it yet. |
20:08:30 | <Shrinks99> | Both of the above support WARCs, Would maybe add an entry for Browsertrix Crawler — though I'm not sure if it actually supports WARC files directly (you can always extract them from the WACZs) |
20:08:33 | <@JAA> | If it uses the APIs rather than MITM proxying, I don't think it can. |
20:09:11 | <nicolas17> | Alyssa: archive.org is not immune to the DMCA, and big recording labels like UMG *will* send takedown requests |
20:10:03 | <@JAA> | Feel free to add it though. Probably leave the 'recommendation' column as a question mark unless you're familiar with our position on WARC spec compliance etc. |
20:10:15 | <Shrinks99> | AFAIK ArchiveWeb.page is spec complaint? But as our UX designer admittedly that's not the area I have the most knowledge about heh |
20:10:22 | <Shrinks99> | Yeah, will leave the recommendation column to you |
20:10:46 | <@JAA> | The big thing is that WARC needs to preserve the data exactly as it is sent by the server. |
20:11:12 | <@JAA> | If you use browser APIs to get the response headers and then combine that back into lines, that may not be what the server sent. |
20:11:17 | <Alyssa> | nicolas17: Good to know ;) |
20:11:19 | <@JAA> | So anything that does that does not comply with the spec. |
20:11:33 | <Shrinks99> | Yeah, unsure |
20:11:51 | <@JAA> | warcio has similar problems, as you may have seen mentioned on that page. |
20:12:30 | <Shrinks99> | I'm guessing WARCIT is also out of the question then haha |
20:13:01 | <@JAA> | Oh, that thing, yes, definitely. |
20:13:30 | <@JAA> | But putting it on the page is still useful so we can put a big red warning next to it. :-) |
20:13:47 | <Shrinks99> | Fair enough ;P |
20:14:14 | <Shrinks99> | FWIW I wouldn't recommend it RN either, I don't think it even writes WARC records correctly? |
20:17:45 | <h2ibot> | Inti83 edited Argentina (+175, Add Educ.ar youtube channel): https://wiki.archiveteam.org/?diff=51587&oldid=51435 |
20:17:46 | <h2ibot> | CreaZyp154 edited URLTeam/Warrior (+249, Added note for is.gd and v.gd erroring on URL…): https://wiki.archiveteam.org/?diff=51582&oldid=50413 |
20:17:47 | <h2ibot> | CreaZyp154 edited URLTeam/Dead (+149, Fixed example links for zud.me and checked and…): https://wiki.archiveteam.org/?diff=51583&oldid=51338 |
20:17:48 | <h2ibot> | CreaZyp154 edited URLTeam (+998, Added a few shorteners): https://wiki.archiveteam.org/?diff=51584&oldid=51492 |
20:17:49 | <h2ibot> | CreaZyp154 edited List of website hosts (+179, Added Free Web Hosting Area): https://wiki.archiveteam.org/?diff=51585&oldid=51565 |
20:17:50 | <h2ibot> | CreaZyp154 edited List of websites excluded from the Wayback Machine/Partial exclusions (+99, …): https://wiki.archiveteam.org/?diff=51586&oldid=51538 |
20:20:45 | <thuban> | JAA: i'm not sure whether it can, but it definitely doesn't https://github.com/webrecorder/archiveweb.page/blob/main/src/requestresponseinfo.js |
20:20:49 | <@JAA> | Essentially, feel free to add any software that interacts with WARCs to that page. I'd recommend keeping the description neutral, but otherwise, anything should be there. |
20:22:12 | <@JAA> | thuban: Yeah, that's about what I expected. The outcome from the whole crocoite/chromebot debacle was that it's impossible to do inside the browser. That's likely also why brozzler uses warcprox. |
20:22:59 | <@JAA> | I doubt anything changed in the browser APIs in the past couple years to allow raw HTTP data access. |
20:23:10 | <@JAA> | There's also the part where WARC can only store HTTP/1.1, not HTTP/2. |
20:23:36 | <thuban> | life sure is complicated |
20:23:54 | <@JAA> | And some webrecorder software writes fake HTTP/1.1 records to get around that IIRC. |
20:24:17 | <@JAA> | (Which, naturally violates the spec.) |
20:24:38 | <Shrinks99> | Wouldn't be surprised, I know in the past Ilya has tried to push for updates to the WARC spec and run into a brick wall to the point where we just made our own file format that builds upon WARCs |
20:25:15 | <fireonlive> | life bad :( |
20:25:22 | <@JAA> | Sounds about right, and he's also not very concerned about data integrity from the discussions I've had with him on GitHub issues. |
20:26:09 | <Shrinks99> | I wouldn't say "not concerned"... There's a reason we have a while spec for cryptographic signing of archives ;) |
20:26:10 | <@JAA> | I'm not sure warcio was *ever* compliant with the spec. |
20:26:26 | <@JAA> | But it certainly hasn't been for years now. |
20:27:30 | <@JAA> | https://github.com/webrecorder/warcio/issues/128 and https://github.com/webrecorder/warcio/issues/129 immediately break that. The former has been in the code since at least 2018. |
20:28:00 | | inedia quits [Quit: WeeChat 4.1.2] |
20:28:20 | <@JAA> | And Ilya's replies in 128 make it quite clear what his stance is, I'd say. |
20:28:43 | | DogsRNice joins |
20:29:36 | <@JAA> | To be clear, I'm referring to the contents of the WARC, not the integrity of the WARC after capture. |
20:29:48 | <@JAA> | The latter is irrelevant if the former is broken, in my opinion. |
20:29:53 | <thuban> | he did agree with you in the end, didn't he? it's just that it's not, you know, actually been changed |
20:30:12 | <@JAA> | He did, but it required a lot more convincing than it should have, and yeah, still unfixed. |
20:30:37 | <Shrinks99> | I can maybe offer some insight as to why it's still unfixed which is that warcio hasn't been the priority for us for a while :P |
20:30:41 | <@JAA> | (I believe I also discussed this elsewhere with him at the time.) |
20:31:16 | | inedia (inedia) joins |
20:31:24 | <@JAA> | Right, it's been more about accessibility? |
20:31:36 | <@JAA> | Or user friendliness, or whatever you want to call it. |
20:31:44 | <Shrinks99> | Wasn't around in 2021 so I don't have first-hand insights, but we'd probably be open to a PR? ...Though there's plenty in the repo that are newer and also unanswered :\ |
20:32:21 | <Shrinks99> | RN the priority is "high fidelity capture" which — putting aside spec compliance — browsers appear to be much better at |
20:33:10 | <Shrinks99> | And of course on my side, yeah, making web crawling & archiving tools more accessible with better UX |
20:34:07 | <@JAA> | Yeah, that's exactly the thing though. 'Putting aside spec compliance' is not something that should ever cross the mind when working on archival. Spec compliance and integrity is paramount to preservation, otherwise you can't be sure whether what is archived is correct. |
20:34:37 | <thuban> | yes, interesting notion of "fidelity", although of course i see what you mean :P |
20:34:56 | <Shrinks99> | Well... I might argue that you can't ensure that either way because not everything sent to browsers over HTTP is content-addressed / signed |
20:35:00 | <@JAA> | It's great to work on those things. But it has to happen within the restrictions of correctly preserving the data, too. |
20:35:14 | <Shrinks99> | But I hear your concerns |
20:35:19 | <@JAA> | True, but no such issue with HTTPS, mostly. |
20:35:37 | <@JAA> | Unfortunately, you can never prove that the data wasn't modified by the crawler, of course. |
20:35:45 | <Shrinks99> | Yeah, that's the big problem |
20:36:00 | <Shrinks99> | and until that's in the spec, IMO, the best you can do is give a good paper trail of who created the archive |
20:36:16 | <@JAA> | It can't be in the spec, because it's impossible with current technology. |
20:36:23 | <@JAA> | TLS would have to be redesigned for it. |
20:36:27 | <Shrinks99> | yes, the TLS spec |
20:36:36 | <@JAA> | And that's never going to happen because it's not a design goal of TLS. |
20:36:50 | <Shrinks99> | Yep! |
20:37:05 | <thuban> | ...and make sure that the people who created the archive have a reputation for being real sticklers about correctness! |
20:37:22 | <@JAA> | And even if it were, you still can't guarantee the data wasn't created later, after the key was leaked or similar. |
20:38:00 | <Shrinks99> | heh, well, that's the thing with providing the record of who created it — the viewer gets to judge if the archive is any good or not based on if they trust the software & user who created it |
20:38:17 | <Shrinks99> | So if Webrecorder's aren't good enough for you, you can make that informed decision! |
20:38:29 | <@JAA> | Indeed |
20:39:11 | <@JAA> | Almost all WARC software writes a warcinfo record with the relevant details. :-) |
20:39:57 | <Shrinks99> | *almost* https://github.com/webrecorder/browsertrix-crawler/issues/452 |
20:40:00 | <Shrinks99> | :P |
20:40:22 | <@JAA> | lol |
20:40:23 | <thuban> | :( |
20:40:56 | | @JAA is not surprised. |
20:44:46 | <Shrinks99> | IDK if that issue is actually correct, I'm pretty sure Browsertrix supports writing warcinfo records |
20:45:11 | <Shrinks99> | ah but maybe not if there's multiple warcs |
20:48:37 | | jasons (jasons) joins |
20:48:50 | <Shrinks99> | Okay well, in my wiki edit for ArchiveWebpage, I'm noting (in yellow) that "Because ArchiveWeb.page intercepts the browser's network requests, it may not write fully spec-compliant WARC files." |
20:49:39 | <Shrinks99> | Because I don't want to leave this here and not elaborate in case it doesn't get further updates, but also edit later if that's not accurate enough for ya? |
20:50:00 | <@JAA> | I'll give it a close look later and edit as necessary. :-) |
20:50:09 | <Shrinks99> | Great, TY :) |
20:50:33 | <Shrinks99> | Ah it gets sent to review |
20:50:43 | <Shrinks99> | well I suppose everything will get sorted out then! |
20:50:46 | <@JAA> | Yeah, beecause spam. |
20:51:04 | <@JAA> | Not a content review (except in extreme cases). |
20:51:52 | <h2ibot> | Shrinks99 edited The WARC Ecosystem (+466, Added up to date data about ReplayWebpage,…): https://wiki.archiveteam.org/?diff=51588&oldid=51519 |
20:52:07 | | line quits [Remote host closed the connection] |
20:52:09 | <nicolas17> | who wants to pay me to make a pcap to warc conversion tool? :p |
20:52:26 | | line joins |
20:54:41 | <@JAA> | > who wants ... me to make a pcap to warc conversion tool? |
20:54:43 | <@JAA> | Yes please! |
20:54:44 | <@JAA> | :-P |
20:55:20 | <fireonlive> | :P |
20:55:51 | | line quits [Remote host closed the connection] |
20:55:53 | <h2ibot> | Shrinks99 edited The WARC Ecosystem (+3, Fixes formatting, also updates PYWB link): https://wiki.archiveteam.org/?diff=51589&oldid=51588 |
20:56:10 | <Shrinks99> | (I fucked up the table formatting oof) |
20:56:44 | <TheTechRobo> | Ugh, my internet being spotty caused me to fuck it up again. lol |
20:56:53 | <h2ibot> | TheTechRobo edited The WARC Ecosystem (+1, fix ArchiveWeb.page being on the same line as…): https://wiki.archiveteam.org/?diff=51590&oldid=51589 |
20:56:59 | <TheTechRobo> | lol |
20:57:33 | <Shrinks99> | Forgot the pipe characters |
20:58:23 | | line joins |
20:58:24 | | line quits [Remote host closed the connection] |
20:58:54 | <h2ibot> | TheTechRobo edited The WARC Ecosystem (+1, Really fix formatting): https://wiki.archiveteam.org/?diff=51591&oldid=51590 |
20:59:16 | <nicolas17> | Preview button |
20:59:54 | <h2ibot> | JustAnotherArchivist changed the user rights of User:Shrinks99 |
20:59:55 | <h2ibot> | JustAnotherArchivist changed the user rights of User:CreaZyp154 |
21:00:26 | <TheTechRobo> | Yeah, I did preview my edit first, but I think S.hrinks99 did an edit at the same time as me causing weirdness, then it took awhile for the edit to submit |
21:00:49 | <Shrinks99> | You're not going to like this but I also submitted a change that added the pipe :P |
21:00:54 | <h2ibot> | JustAnotherArchivist changed the user rights of User:Inti83 |
21:00:55 | <Shrinks99> | Either way, should be sorted now |
21:00:55 | | line joins |
21:01:31 | <TheTechRobo> | Hahaha I should have left it up to you :P |
21:01:37 | | line quits [Remote host closed the connection] |
21:01:49 | <TheTechRobo> | s/up to/for/ |
21:02:33 | | line joins |
21:03:51 | | line quits [Remote host closed the connection] |
21:05:22 | | line joins |
21:06:30 | | line quits [Remote host closed the connection] |
21:06:55 | | line joins |
21:08:35 | <Shrinks99> | Alrighty well, signing off for now, @JAA I'll pass along your thoughts and tag these issues but can't promise fixes any time soon — I think we're open to PRs tho! |
21:09:40 | <fireonlive> | Shrinks99: thanks for stopping by! hope to see you back again |
21:09:45 | | line quits [Remote host closed the connection] |
21:09:50 | <Shrinks99> | <3 |
21:09:54 | | Shrinks99 quits [Client Quit] |
21:10:12 | <fireonlive> | i was about to say we’re in the same open to PR pickle but too late haha |
21:10:41 | | line joins |
21:12:57 | <h2ibot> | TheTechRobo edited Strawpoll.me (+77, Add comments URL): https://wiki.archiveteam.org/?diff=51592&oldid=51408 |
21:13:04 | <TheTechRobo> | Two questions: |
21:14:25 | <TheTechRobo> | 1. Should the wiki page on the wiki's Twitch page be updated to #burnthetwitch, or is that channel just for my bot? |
21:14:32 | <TheTechRobo> | (I forgot my second question, lol) |
21:16:10 | | line quits [Remote host closed the connection] |
21:17:01 | | line joins |
21:19:58 | <h2ibot> | TheTechRobo edited Twitch.tv (+27, Add link to Archives section in infobox): https://wiki.archiveteam.org/?diff=51593&oldid=51474 |
21:22:26 | | line quits [Remote host closed the connection] |
21:22:46 | | line joins |
21:24:38 | | line quits [Remote host closed the connection] |
21:25:48 | | line joins |
21:28:31 | | line quits [Remote host closed the connection] |
21:29:01 | | line joins |
21:34:36 | | line quits [Remote host closed the connection] |
21:34:56 | | line joins |
21:35:08 | <pokechu22> | line: please fix your connection when you get a chance |
21:42:29 | | BlueMaxima joins |
21:45:20 | | jasons quits [Ping timeout: 240 seconds] |
21:46:43 | | f_ (funderscore) joins |
21:59:42 | | line quits [Remote host closed the connection] |
22:00:03 | | line joins |
22:05:29 | | line quits [Remote host closed the connection] |
22:05:50 | | line joins |
22:11:38 | <line> | pokechu22: yes, reconfigured router, sorry. fixed now |
22:12:10 | <fireonlive> | :) |
22:13:12 | | Arcorann (Arcorann) joins |
22:15:21 | | tertu2 quits [Ping timeout: 272 seconds] |
22:17:11 | | tertu (tertu) joins |
22:17:56 | | evan quits [Remote host closed the connection] |
22:17:56 | | c3manu quits [Read error: Connection reset by peer] |
22:17:56 | | shreyasminocha quits [Remote host closed the connection] |
22:17:56 | | thehedgeh0g quits [Remote host closed the connection] |
22:17:59 | | evan joins |
22:18:02 | | thehedgeh0g (mrHedgehog0) joins |
22:18:02 | | shreyasminocha (shreyasminocha) joins |
22:18:02 | | c3manu (c3manu) joins |
22:22:46 | <fireonlive> | all: the situation with nitter is looking dire and all nitter instances will probably stop working in ~2-3 weeks for an unknown period of time. given this, if you have any accounts you'd like archived burning a hole in your pocket (or mean to look for some) please add them to https://pad.notkiska.pw/p/archivebot-twitter [give the notes a quick read |
22:22:46 | <fireonlive> | if you haven't, check if they're not on the list already please] so they can be run them while nitter is still functional... :c |
22:22:46 | <fireonlive> | if you don't have access to AB yourself please leave a reason after the username in parens (i.e. (died, bankrupt, notable, whatever)) for archival and don't forget to set your name in the pad |
22:22:51 | <fireonlive> | further reading about nitter itself in the recent comments of https://github.com/zedeus/nitter/issues/983 if curious |
22:23:54 | <fireonlive> | also at some point earlier eggdrop will probably stop nitterizing every twitter link for similar reasons :p |
22:26:59 | | bladem (bladem) joins |
22:30:30 | <@HCross> | Can we not put random Nitter instances through AB please |
22:30:35 | <@HCross> | that'll just be nasty |
22:31:14 | <fireonlive> | HCross: indeed, we're using a special one for AT hosted by Barto |
22:31:56 | <@HCross> | I still don't think that's an amazing idea unless you're using different tokens |
22:32:09 | <@JAA> | Each instance has its own tokens, yes. |
22:32:41 | <@HCross> | Oh good |
22:33:08 | <fireonlive> | =] |
22:38:57 | <Barto> | HCross: you're somewhat not wrong, and we tried to stay on a single instance as much as we could. |
22:39:24 | <Barto> | i would have loved to not have done this, but rich electric imposteur decided the opposite |
22:42:30 | <fireonlive> | (thanks Barto by the way!) |
22:46:51 | <Barto> | fireonlive: i really feel like the smallest gear in this whole machinery |
22:48:08 | <fireonlive> | infrastructure providers/upkeepers need love too :D |
22:49:14 | | jasons (jasons) joins |
22:50:28 | <Barto> | this setup is literally running under a chair, on the floor. Can't get more ghetto than that (or shall we call it cyberpunk?). |
22:50:58 | <Barto> | it's a lovely small Odroid H2+, with 16GB of ram |
22:58:49 | <fireonlive> | i couldn't think of anything more archiveteam |
22:58:52 | <fireonlive> | :3 |
22:59:19 | <DigitalDragons> | i was gonna say, isn't jank the norm here? |
23:00:24 | <Barto> | :-) |
23:02:39 | <fireonlive> | :) |
23:04:04 | <@JAA> | It's a requirement. |
23:26:36 | <fireonlive> | Zed (creator of nitter): Guest accounts have been removed, they weren't just left to believe that. With real accounts getting rate limited immediately and likely banned, I don't see any path forward for Nitter. ~ https://github.com/zedeus/nitter/issues/983#issuecomment-1913362376 |
23:32:12 | <DogsRNice> | just scrape using an account that spams fake bitcoin stuff so it wont be banned |
23:33:17 | <@JAA> | Not dogecoin? |
23:38:19 | <fireonlive> | soooo many porn bots |
23:38:23 | <fireonlive> | and they're all straight >:| |
23:46:20 | | jasons quits [Ping timeout: 240 seconds] |