00:04:12 | | beardicus quits [Ping timeout: 250 seconds] |
00:04:58 | | beardicus (beardicus) joins |
00:12:23 | | beardicus quits [Ping timeout: 260 seconds] |
00:17:21 | <nicolas17> | all archived data is in the .warc.zst, the other files in the item don't really matter |
00:17:44 | <nicolas17> | unless you're searching by exact URL, in which case the .cdx would be faster... but then you can just use web.archive.org |
00:20:28 | | etnguyen03 quits [Client Quit] |
00:41:06 | <Webuser254947> | Thank you! I'm going to try and figure out how to open the warc.zst file |
00:55:40 | | sec^nd quits [Ping timeout: 276 seconds] |
00:56:43 | | sec^nd (second) joins |
00:57:50 | <@OrIdow6> | Webuser254947: So technical info (IDK if you'll be able to understand this or not) the zst files are compressed with a custom dictionary so you need to pass in that dict when decompressing |
00:58:11 | <@OrIdow6> | THe dict (itself compressed with vanilla zstd) sits in a skippable frame at the beginning of the file |
00:58:55 | <@OrIdow6> | This script https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat can help |
00:58:57 | | etnguyen03 (etnguyen03) joins |
01:03:13 | <Webuser254947> | Seems like I severely overestimated my abilities. I read that using https://replayweb.page/ can be used to view it, but it seems like it needs to formatted somehow? |
01:06:17 | <Webuser254947> | is that by doing the script that you sent? |
01:14:08 | <@OrIdow6> | Webuser254947: Uh, I guess what you'd do would be zstwarccat yahooanswers-whatever.warc.zst out.warc and then put out.warc into replyweb.page |
01:14:18 | <@OrIdow6> | However this will take up like, hundreds of GBs of disk space at least |
01:15:16 | <@OrIdow6> | `zstwarccat yahooanswers-whatever.warc.zst out.warc` |
01:17:20 | <@JAA> | `zstwarccat yahooanswers-whatever.warc.zst >out.warc` * |
01:17:29 | | beardicus (beardicus) joins |
01:18:10 | <@OrIdow6> | JAA: Ah I can't read |
01:27:53 | <Webuser254947> | If it takes up hundreds of GBs opening it up on replyweb, idk if it's wise to try doing this on my crappy little HP laptop |
01:28:34 | <@OrIdow6> | Webuser254947: xD you might have some problems |
01:28:50 | <@OrIdow6> | The main issue is that it's just so much data you need some crazy stuff to handle it all |
01:29:02 | <@OrIdow6> | Built-in Replayweb support or not |
01:29:18 | | @OrIdow6 needs to get better at TS and go implement it for them |
01:30:51 | <Webuser254947> | Yeah, 35TB is definitely a bit of data. I just wish the archive had a user friendly search option |
01:31:19 | <@OrIdow6> | Generally that's considered a good thing due to the way it prevents stalking people in retrospect |
01:31:30 | <@OrIdow6> | But yeah on sites like this where you can be "anonymous" it's a pain |
01:31:50 | <@OrIdow6> | When you're looking for your own stuff |
01:33:01 | <Webuser254947> | Might have a better chance downloading the cdx and spamming in URLs from the estimated dates lol |
01:33:19 | <Webuser254947> | In theory if I find one I can link back to my profile and have access to everything else |
01:34:43 | | BlueMaxima quits [Read error: Connection reset by peer] |
01:38:12 | <@OrIdow6> | Webuser254947: I just don't think there's a good way |
01:38:53 | <@OrIdow6> | Maybe they used some alternate URL form in the past that'd let you map human-readable username to ID? But there's no guarantee such a thing existed, nor that if it did it was archived |
01:53:44 | | LddPotato quits [Remote host closed the connection] |
01:55:58 | <Webuser254947> | Is there a difference between "item CDX index" and "WARC CDX index"? I was only able to open item CDX index in notepad and the URLS arent loading properly |
01:56:05 | | LddPotato (LddPotato) joins |
01:57:31 | <@JAA> | There is a difference iff the item contains more than one WARC. |
01:57:49 | <@JAA> | The item index is a sorted concatenation of the WARC indices. |
01:58:18 | <@JAA> | Most of our items usually only contain one WARC, so they'll be identical in that case. |
01:59:07 | <h2ibot> | TheTechRobo edited Frequently Asked Questions (+677, Add FAQ entry for 'which files do I need'): https://wiki.archiveteam.org/?diff=54246&oldid=54207 |
01:59:36 | <TheTechRobo> | Webuser254947: Does the file name end in gz? If so, you need to decompress it first. |
01:59:48 | <@JAA> | TheTechRobo++ |
01:59:48 | <eggdrop> | [karma] 'TheTechRobo' now has 10 karma! |
02:00:56 | | LddPotato quits [Remote host closed the connection] |
02:01:46 | <Webuser254947> | TheTechRobo Yes, it ends in gz |
02:02:32 | <TheTechRobo> | That would explain it then. You'll need to decompress it with gzip. How to do that depends on what operating system you're running |
02:02:46 | <Webuser254947> | I'm on Windows |
02:03:05 | <TheTechRobo> | I think WinZip and 7zip can extract them, if you have either of those installed |
02:05:14 | | BornOn420 quits [Ping timeout: 276 seconds] |
02:07:25 | | LddPotato (LddPotato) joins |
02:08:28 | | beardicus quits [Ping timeout: 260 seconds] |
02:08:48 | <nicolas17> | hmm seems iOS+iPadOS ipsw files add up to 59.5 TiB |
02:10:38 | | beardicus (beardicus) joins |
02:13:10 | <Webuser254947> | TheTechRobo I used WinZip to decompress but the file still looks roughly the same. When I enter a URL into the wayback machine it wont load |
02:13:33 | <TheTechRobo> | what do you mean by 'won't load'? |
02:17:18 | | BornOn420 (BornOn420) joins |
02:19:25 | <Webuser254947> | wait a minute, I think I got it! When I was entering the URL from the compressed file, it would refresh but not update the page. But now it seems to be working alright |
02:20:44 | <Webuser254947> | Thank you all for all of your help! I'll work on my hunt tomorrow- I'm exhausted. Hope everyone has a good night :) |
02:21:18 | | beardicus quits [Ping timeout: 260 seconds] |
02:24:16 | | Webuser254947 quits [Quit: Ooops, wrong browser tab.] |
02:28:00 | | beardicus (beardicus) joins |
02:29:04 | <nicolas17> | JAA: arkiver already said it would be ok to archive Apple firmwares despite that huge total size, but I assume I can't just throw the entire list into AB at once :P should I add them in smaller lists, use high delay, wait for free slots, all of the above? |
02:29:44 | <nicolas17> | (won't be now anyway, I suspect there's a nice amount already archived properly and I want to skip those) |
02:29:59 | <pokechu22> | nicolas17: how many URLs? |
02:30:38 | <pokechu22> | the most important thing is making sure it lands on a pipeline with lots of free disk space (so based on http://archivebot.com/pipelines that'd be daguerreo or pokepipe probably) |
02:31:29 | <nicolas17> | well I was wondering that, do jobs do incremental uploads, or does the *entire* job have to fit in the pipeline's disk? |
02:32:56 | <nicolas17> | large jobs end up producing multiple WARCs, but I don't know if those WARCs are uploaded as they are each finished, or only at the end of the whole job |
02:33:35 | <@JAA> | nicolas17: Also, is there a significant amount of dupes in these? |
02:33:51 | <pokechu22> | They're uploaded incrementally (every 5GB of compressed WARC can be uploaded separately) |
02:33:54 | <@JAA> | WARCs are uploaded as they get produced. |
02:34:05 | <@JAA> | 5 GiB* |
02:36:02 | <nicolas17> | JAA: I think every file in https://updates.cdn-apple.com/ is also reachable via http://updates-http.cdn-apple.com/ but I was hoping to ignore that and only archive one of them >_< |
03:09:11 | | Naruyoko5 quits [Read error: Connection reset by peer] |
03:14:46 | | Webuser080345 joins |
03:20:45 | | Webuser080345 quits [Client Quit] |
03:21:23 | | wickedplayer494 quits [Read error: Connection reset by peer] |
03:21:59 | | Webuser172320 joins |
03:23:41 | | wickedplayer494 joins |
03:24:21 | | Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ] |
03:25:04 | | Shjosan (Shjosan) joins |
03:37:07 | | wickedplayer494 is now authenticated as wickedplayer494 |
03:42:23 | | beardicus quits [Ping timeout: 260 seconds] |
03:47:22 | | pixel leaves [Error from remote client] |
03:47:29 | | pixel (pixel) joins |
04:13:44 | | etnguyen03 quits [Remote host closed the connection] |
04:19:27 | | BlueMaxima joins |
04:26:17 | | SootBector quits [Remote host closed the connection] |
04:26:38 | | SootBector (SootBector) joins |
04:34:09 | | Webuser987869 joins |
04:36:09 | <nicolas17> | ...but other than the equivalent subdomains I think there are no duplicated files, or maybe there's 1 or 2 out of 10k |
04:36:13 | | Webuser987869 quits [Client Quit] |
04:37:15 | <nicolas17> | there's probably some duplication with existing captures though, so I plan to query cdx API to see which ones are already archived |
04:41:36 | <@arkiver> | nicolas17: on both URLs, if you plan to make WARCs, maybe archiving them with Wget-AT with URL-agnostic deduplication turned on is something that could work? |
04:44:05 | <nicolas17> | it would work, but I was hoping to throw them on AB or something, instead of messing with WARCs and local storage and 50Mbps uploads :P |
04:54:11 | | mls (mls) joins |
05:05:13 | | wickedplayer494 quits [Ping timeout: 260 seconds] |
05:05:43 | | wickedplayer494 joins |
05:05:46 | | wickedplayer494 is now authenticated as wickedplayer494 |
05:29:52 | <DigitalDragons> | 50mbps to IA seems to be not happening for anyone right now anyways :p |
05:33:23 | <steering> | 50M upload? how fortunate :P |
05:46:32 | | wickedplayer494 quits [Ping timeout: 250 seconds] |
06:01:38 | | wickedplayer494 joins |
06:01:42 | | wickedplayer494 is now authenticated as wickedplayer494 |
06:07:46 | | dontwashyourhands (dontwashyourhands) joins |
06:08:16 | <dontwashyourhands> | Hey, lately I'm getting all kinds of captchas on archive.today in different languages I don't recognize. Anyone else experiencing this? |
06:08:37 | <dontwashyourhands> | Every time I save a new page, I get a captcha, and the languages are constantly changing |
06:09:11 | <pokechu22> | Can you post a screenshot of one of them? When I get captchas on there they're usually the "train our self-triving car by clicking on traffic lights" ones |
06:09:13 | <dontwashyourhands> | I recently switched ISPs so maybe that could explain it, but ¯\_(ツ)_/¯ |
06:09:33 | <dontwashyourhands> | pokechu22: that's exactly what I'm getting |
06:10:04 | <pokechu22> | What makes them language-specific? |
06:10:50 | <dontwashyourhands> | https://i.ibb.co/XyqK6DZ/Screenshot-2025-01-17-011004.png |
06:11:01 | <dontwashyourhands> | Most of the time, I click the box and it's fine |
06:12:06 | <dontwashyourhands> | Sometimes I have to do the "click all the images with X" captcha, except the instructions are in a script I don't even know the name of. However, so far I've had no problem solving them just from the images without understanding the text |
06:12:24 | <pokechu22> | Huh. I haven't seen that. Not sure how it decides what language to use. |
06:13:01 | <dontwashyourhands> | Every time I share a page, it's a different language |
06:13:04 | <dontwashyourhands> | *save a page |
06:14:11 | <dontwashyourhands> | I was wondering if this was a site-wide thing happening with archive.today, but I haven't seen anyone else report this behaviour recently |
06:14:18 | <dontwashyourhands> | So, it's probably somehow related to me switching ISPs |
06:14:37 | | BlueMaxima quits [Client Quit] |
06:15:09 | <pokechu22> | The increased number of captchas you're getting probably is related to switching ISPs, but the random languages thing probably isn't IMO. archive.today is a bit weird though so who knows |
06:19:10 | <dontwashyourhands> | it went from 0% captchas to 100% captchas, and then in randomly rotating languages, which is odd |
06:19:31 | <dontwashyourhands> | "and then": I actually mean at the same time, not after |
06:19:39 | <dontwashyourhands> | anyway, it's a minor inconvenience, I was mostly just curious |
06:20:45 | <szczot3k> | webmaster@archive.ph might be able to help you |
06:23:58 | <dontwashyourhands> | good tip |
06:24:09 | <dontwashyourhands> | not sure if I should bother him, though |
06:24:52 | <szczot3k> | They either will provide some feedback, or they'll ignore it |
06:25:32 | | DogsRNice quits [Read error: Connection reset by peer] |
06:26:48 | <dontwashyourhands> | thanks |
06:58:01 | | loug8318142 joins |
07:26:53 | | dontwashyourhands quits [Client Quit] |
08:35:17 | | i_have_n0_idea quits [Quit: Ping timeout (120 seconds)] |
08:35:29 | | i_have_n0_idea (i_have_n0_idea) joins |
09:15:19 | | nulldata quits [Quit: So long and thanks for all the fish!] |
09:16:20 | | nulldata (nulldata) joins |
09:18:00 | | balrog quits [Ping timeout: 250 seconds] |
09:22:59 | | Island quits [Read error: Connection reset by peer] |
10:31:55 | <@OrIdow6> | JAA: I'm only a bit faster than Slowpoke here but is there any hope for running coiledfist.org in AB as-is? I see you made a valiant effort but it looked like CF knows all the AB pipeline addresses? |
12:00:03 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
12:00:05 | | etnguyen03 (etnguyen03) joins |
12:02:46 | | Bleo18260072271962345 joins |
12:14:13 | <eggdrop> | [remind] OrIdow6: see if the a-cho job got http://www.a-cho.com/ac/res_2019.html and http://www.a-cho.com/ac/res_2020.html |
12:58:58 | | Stagnant_ quits [Remote host closed the connection] |
12:59:03 | | Stagnant_ (Stagnant) joins |
13:07:37 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
13:09:14 | | SkilledAlpaca418962 joins |
13:20:42 | | etnguyen03 quits [Client Quit] |
13:24:34 | | etnguyen03 (etnguyen03) joins |
13:58:10 | | th3z0l4 quits [Read error: Connection reset by peer] |
14:14:53 | <@JAA> | OrIdow6: Yeah, none of the attempts worked. I'm not sure what it's based on, but it's probably a combination of IP reputation and TLS fingerprinting, i.e. fingerprinting if the IP is classified as a datacentre or something like that. |
14:53:41 | | yourfate (yourfate) joins |
14:59:13 | <yourfate> | can I check the version of my warrior from the web UI? doesn't seem that way |
14:59:26 | <yourfate> | mine shut down unexpectedly, IDK when tbh, I don't see it in the proxmox log |
15:00:07 | <yourfate> | and before I start debugging that i'd like to make sure I have the latest and greatest |
15:20:28 | | lflare quits [Quit: Bye] |
15:20:42 | | etnguyen03 quits [Client Quit] |
15:21:03 | | lflare (lflare) joins |
15:21:29 | | etnguyen03 (etnguyen03) joins |
15:33:22 | | pixel leaves [Disconnected: Replaced by new connection] |
15:33:34 | | pixel (pixel) joins |
15:52:07 | | beardicus (beardicus) joins |
15:56:48 | | beardicus quits [Ping timeout: 260 seconds] |
16:04:51 | | beardicus (beardicus) joins |
16:11:19 | | Lord_Nightmare quits [Quit: ZNC - http://znc.in] |
16:14:59 | | Lord_Nightmare (Lord_Nightmare) joins |
16:33:16 | <kpcyrd> | just to be sure, can I upload to archive.org from ipv6 only computers? |
16:33:37 | <kpcyrd> | archive.org itself doesn't have AAAA records, but maybe the upload endpoints do? |
16:34:40 | <kpcyrd> | I can cope with ghcr.io not having ipv6, but if archive.org is also legacy-ip only I'm just going to pay 0.6€ extra for not having to bother with either |
16:35:19 | <@imer> | kpcyrd: doesn't look like it :( if you're asking about AT projects - our targets usually have (and prefer) ipv6 |
16:36:39 | <@imer> | although projects working with just v6 are also an issue I guess |
16:38:51 | <h2ibot> | Cooljeanius edited TikTok (+46, /* Vital Signs */ update on US ban status): https://wiki.archiveteam.org/?diff=54247&oldid=54070 |
16:39:51 | <h2ibot> | Cooljeanius edited TikTok (+6, /* Vital Signs */ oops, forgot to close my…): https://wiki.archiveteam.org/?diff=54248&oldid=54247 |
17:10:11 | <kpcyrd> | imer: it's not an AT project unfortunately, I need to upload to archive.org myself </3 |
17:28:32 | | BornOn420 quits [Remote host closed the connection] |
17:28:33 | | sec^nd quits [Write error: Broken pipe] |
17:29:06 | | sec^nd (second) joins |
17:29:06 | | BornOn420 (BornOn420) joins |
17:45:26 | | legoktm quits [Ping timeout: 250 seconds] |
17:50:26 | | legoktm joins |
17:54:58 | | adamus1red quits [Quit: SigTerm] |
17:55:47 | | notarobot1 quits [Quit: The Lounge - https://thelounge.chat] |
17:56:09 | | notarobot1 joins |
17:58:14 | | adamus1red (adamus1red) joins |
18:08:34 | | etnguyen03 quits [Client Quit] |
18:08:38 | | katocala quits [Ping timeout: 260 seconds] |
18:08:51 | | katocala joins |
18:19:19 | | etnguyen03 (etnguyen03) joins |
18:24:00 | | katocala quits [Ping timeout: 250 seconds] |
18:24:36 | | katocala joins |
19:04:07 | | HP_Archivist quits [Quit: Leaving] |
19:16:51 | | ducky quits [Read error: Connection reset by peer] |
19:19:47 | | ducky (ducky) joins |
19:31:54 | <@OrIdow6> | JAA: ah, tx |
19:31:56 | <@OrIdow6> | *thx |
19:47:35 | | ljcool2006 quits [Remote host closed the connection] |
19:48:36 | | ljcool2006 joins |
20:12:18 | | beardicus quits [Ping timeout: 260 seconds] |
20:13:09 | | DogsRNice joins |
20:32:42 | | benjins3 quits [Ping timeout: 250 seconds] |
20:36:35 | | beardicus (beardicus) joins |
20:44:10 | | anonchek joins |
21:05:15 | | etnguyen03 quits [Client Quit] |
21:05:58 | | anonchek quits [Ping timeout: 260 seconds] |
21:19:58 | | beardicus quits [Ping timeout: 260 seconds] |
21:23:24 | | Shyy quits [Ping timeout: 250 seconds] |
21:55:42 | | benjins3 joins |
22:13:17 | | Island joins |
22:13:36 | | midou quits [Read error: Connection reset by peer] |
22:15:15 | | lunik11 quits [Quit: :x] |
22:15:48 | | lunik11 joins |
22:16:12 | <h2ibot> | Himond000 edited Niconico (+158, /* Nico Nico Shunga */ about Shunga service…): https://wiki.archiveteam.org/?diff=54249&oldid=54245 |
22:17:12 | <h2ibot> | Himond000 edited Niconico (+51, /* Nico Nico Seiga */): https://wiki.archiveteam.org/?diff=54250&oldid=54249 |
22:53:32 | | beardicus (beardicus) joins |
22:53:49 | <szczot3k> | Do we have a boilerplate abuse response for AT? |
22:55:06 | <szczot3k> | I have abuse mailbox on my ovh nodes set to my contact, so I just got one |
22:58:29 | | etnguyen03 (etnguyen03) joins |
22:59:26 | <@OrIdow6> | Anyone able to access https://seiga.nicovideo.jp/ ? Subsection being deleted EOM per edits a few messages above |
22:59:39 | <@OrIdow6> | Gets me an error page from the US |
23:00:19 | <@OrIdow6> | ... for context this site restricted access from foreign countries, in a way the announcement left kinda vauge, because foreigners didn't like that they had porn |
23:00:31 | <szczot3k> | EU/Poland: moves me to https://www.nicovideo.jp/region_restriction |
23:00:55 | <@OrIdow6> | Hmmm, that's a bad sign |
23:01:38 | <@OrIdow6> | Might have to get it from Japan |
23:01:57 | <@OrIdow6> | !remindme 4h add thing to dw |
23:01:57 | <eggdrop> | [remind] ok, i'll remind you at 2025-01-18T03:01:57Z |
23:02:43 | | Shyy joins |
23:03:09 | <@imer> | uk has the same redirect too |
23:14:10 | <FireFly> | EU/Germany, same here |
23:14:32 | <FireFly> | (which I guess tracks that it matches) |
23:17:28 | | tech234a quits [Quit: Connection closed for inactivity] |
23:21:41 | <@JAA> | CH, FI, NZ: same |
23:32:40 | | etnguyen03 quits [Client Quit] |
23:44:38 | | riteo quits [Ping timeout: 260 seconds] |
23:46:45 | <pokechu22> | US redirects too |
23:46:55 | | riteo (riteo) joins |