00:04:12beardicus quits [Ping timeout: 250 seconds]
00:04:58beardicus (beardicus) joins
00:12:23beardicus quits [Ping timeout: 260 seconds]
00:17:21<nicolas17>all archived data is in the .warc.zst, the other files in the item don't really matter
00:17:44<nicolas17>unless you're searching by exact URL, in which case the .cdx would be faster... but then you can just use web.archive.org
00:20:28etnguyen03 quits [Client Quit]
00:41:06<Webuser254947>Thank you! I'm going to try and figure out how to open the warc.zst file
00:55:40sec^nd quits [Ping timeout: 276 seconds]
00:56:43sec^nd (second) joins
00:57:50<@OrIdow6>Webuser254947: So technical info (IDK if you'll be able to understand this or not) the zst files are compressed with a custom dictionary so you need to pass in that dict when decompressing
00:58:11<@OrIdow6>THe dict (itself compressed with vanilla zstd) sits in a skippable frame at the beginning of the file
00:58:55<@OrIdow6>This script https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat can help
00:58:57etnguyen03 (etnguyen03) joins
01:03:13<Webuser254947>Seems like I severely overestimated my abilities. I read that using https://replayweb.page/ can be used to view it, but it seems like it needs to formatted somehow?
01:06:17<Webuser254947>is that by doing the script that you sent?
01:14:08<@OrIdow6>Webuser254947: Uh, I guess what you'd do would be zstwarccat yahooanswers-whatever.warc.zst out.warc and then put out.warc into replyweb.page
01:14:18<@OrIdow6>However this will take up like, hundreds of GBs of disk space at least
01:15:16<@OrIdow6>`zstwarccat yahooanswers-whatever.warc.zst out.warc`
01:17:20<@JAA>`zstwarccat yahooanswers-whatever.warc.zst >out.warc` *
01:17:29beardicus (beardicus) joins
01:18:10<@OrIdow6>JAA: Ah I can't read
01:27:53<Webuser254947>If it takes up hundreds of GBs opening it up on replyweb, idk if it's wise to try doing this on my crappy little HP laptop
01:28:34<@OrIdow6>Webuser254947: xD you might have some problems
01:28:50<@OrIdow6>The main issue is that it's just so much data you need some crazy stuff to handle it all
01:29:02<@OrIdow6>Built-in Replayweb support or not
01:29:18@OrIdow6 needs to get better at TS and go implement it for them
01:30:51<Webuser254947>Yeah, 35TB is definitely a bit of data. I just wish the archive had a user friendly search option
01:31:19<@OrIdow6>Generally that's considered a good thing due to the way it prevents stalking people in retrospect
01:31:30<@OrIdow6>But yeah on sites like this where you can be "anonymous" it's a pain
01:31:50<@OrIdow6>When you're looking for your own stuff
01:33:01<Webuser254947>Might have a better chance downloading the cdx and spamming in URLs from the estimated dates lol
01:33:19<Webuser254947>In theory if I find one I can link back to my profile and have access to everything else
01:34:43BlueMaxima quits [Read error: Connection reset by peer]
01:38:12<@OrIdow6>Webuser254947: I just don't think there's a good way
01:38:53<@OrIdow6>Maybe they used some alternate URL form in the past that'd let you map human-readable username to ID? But there's no guarantee such a thing existed, nor that if it did it was archived
01:53:44LddPotato quits [Remote host closed the connection]
01:55:58<Webuser254947>Is there a difference between "item CDX index" and "WARC CDX index"? I was only able to open item CDX index in notepad and the URLS arent loading properly
01:56:05LddPotato (LddPotato) joins
01:57:31<@JAA>There is a difference iff the item contains more than one WARC.
01:57:49<@JAA>The item index is a sorted concatenation of the WARC indices.
01:58:18<@JAA>Most of our items usually only contain one WARC, so they'll be identical in that case.
01:59:07<h2ibot>TheTechRobo edited Frequently Asked Questions (+677, Add FAQ entry for 'which files do I need'): https://wiki.archiveteam.org/?diff=54246&oldid=54207
01:59:36<TheTechRobo>Webuser254947: Does the file name end in gz? If so, you need to decompress it first.
01:59:48<@JAA>TheTechRobo++
01:59:48<eggdrop>[karma] 'TheTechRobo' now has 10 karma!
02:00:56LddPotato quits [Remote host closed the connection]
02:01:46<Webuser254947>TheTechRobo Yes, it ends in gz
02:02:32<TheTechRobo>That would explain it then. You'll need to decompress it with gzip. How to do that depends on what operating system you're running
02:02:46<Webuser254947>I'm on Windows
02:03:05<TheTechRobo>I think WinZip and 7zip can extract them, if you have either of those installed
02:05:14BornOn420 quits [Ping timeout: 276 seconds]
02:07:25LddPotato (LddPotato) joins
02:08:28beardicus quits [Ping timeout: 260 seconds]
02:08:48<nicolas17>hmm seems iOS+iPadOS ipsw files add up to 59.5 TiB
02:10:38beardicus (beardicus) joins
02:13:10<Webuser254947>TheTechRobo I used WinZip to decompress but the file still looks roughly the same. When I enter a URL into the wayback machine it wont load
02:13:33<TheTechRobo>what do you mean by 'won't load'?
02:17:18BornOn420 (BornOn420) joins
02:19:25<Webuser254947>wait a minute, I think I got it! When I was entering the URL from the compressed file, it would refresh but not update the page. But now it seems to be working alright
02:20:44<Webuser254947>Thank you all for all of your help! I'll work on my hunt tomorrow- I'm exhausted. Hope everyone has a good night :)
02:21:18beardicus quits [Ping timeout: 260 seconds]
02:24:16Webuser254947 quits [Quit: Ooops, wrong browser tab.]
02:28:00beardicus (beardicus) joins
02:29:04<nicolas17>JAA: arkiver already said it would be ok to archive Apple firmwares despite that huge total size, but I assume I can't just throw the entire list into AB at once :P should I add them in smaller lists, use high delay, wait for free slots, all of the above?
02:29:44<nicolas17>(won't be now anyway, I suspect there's a nice amount already archived properly and I want to skip those)
02:29:59<pokechu22>nicolas17: how many URLs?
02:30:38<pokechu22>the most important thing is making sure it lands on a pipeline with lots of free disk space (so based on http://archivebot.com/pipelines that'd be daguerreo or pokepipe probably)
02:31:29<nicolas17>well I was wondering that, do jobs do incremental uploads, or does the *entire* job have to fit in the pipeline's disk?
02:32:56<nicolas17>large jobs end up producing multiple WARCs, but I don't know if those WARCs are uploaded as they are each finished, or only at the end of the whole job
02:33:35<@JAA>nicolas17: Also, is there a significant amount of dupes in these?
02:33:51<pokechu22>They're uploaded incrementally (every 5GB of compressed WARC can be uploaded separately)
02:33:54<@JAA>WARCs are uploaded as they get produced.
02:34:05<@JAA>5 GiB*
02:36:02<nicolas17>JAA: I think every file in https://updates.cdn-apple.com/ is also reachable via http://updates-http.cdn-apple.com/ but I was hoping to ignore that and only archive one of them >_<
03:09:11Naruyoko5 quits [Read error: Connection reset by peer]
03:14:46Webuser080345 joins
03:20:45Webuser080345 quits [Client Quit]
03:21:23wickedplayer494 quits [Read error: Connection reset by peer]
03:21:59Webuser172320 joins
03:23:41wickedplayer494 joins
03:24:21Shjosan quits [Quit: Am sleepy (-, – )…zzzZZZ]
03:25:04Shjosan (Shjosan) joins
03:42:23beardicus quits [Ping timeout: 260 seconds]
03:47:22pixel leaves [Error from remote client]
03:47:29pixel (pixel) joins
04:13:44etnguyen03 quits [Remote host closed the connection]
04:19:27BlueMaxima joins
04:26:17SootBector quits [Remote host closed the connection]
04:26:38SootBector (SootBector) joins
04:34:09Webuser987869 joins
04:36:09<nicolas17>...but other than the equivalent subdomains I think there are no duplicated files, or maybe there's 1 or 2 out of 10k
04:36:13Webuser987869 quits [Client Quit]
04:37:15<nicolas17>there's probably some duplication with existing captures though, so I plan to query cdx API to see which ones are already archived
04:41:36<@arkiver>nicolas17: on both URLs, if you plan to make WARCs, maybe archiving them with Wget-AT with URL-agnostic deduplication turned on is something that could work?
04:44:05<nicolas17>it would work, but I was hoping to throw them on AB or something, instead of messing with WARCs and local storage and 50Mbps uploads :P
04:54:11mls (mls) joins
05:05:13wickedplayer494 quits [Ping timeout: 260 seconds]
05:05:43wickedplayer494 joins
05:29:52<DigitalDragons>50mbps to IA seems to be not happening for anyone right now anyways :p
05:33:23<steering>50M upload? how fortunate :P
05:46:32wickedplayer494 quits [Ping timeout: 250 seconds]
06:01:38wickedplayer494 joins
06:07:46dontwashyourhands (dontwashyourhands) joins
06:08:16<dontwashyourhands>Hey, lately I'm getting all kinds of captchas on archive.today in different languages I don't recognize. Anyone else experiencing this?
06:08:37<dontwashyourhands>Every time I save a new page, I get a captcha, and the languages are constantly changing
06:09:11<pokechu22>Can you post a screenshot of one of them? When I get captchas on there they're usually the "train our self-triving car by clicking on traffic lights" ones
06:09:13<dontwashyourhands>I recently switched ISPs so maybe that could explain it, but ¯\_(ツ)_/¯
06:09:33<dontwashyourhands>pokechu22: that's exactly what I'm getting
06:10:04<pokechu22>What makes them language-specific?
06:10:50<dontwashyourhands>https://i.ibb.co/XyqK6DZ/Screenshot-2025-01-17-011004.png
06:11:01<dontwashyourhands>Most of the time, I click the box and it's fine
06:12:06<dontwashyourhands>Sometimes I have to do the "click all the images with X" captcha, except the instructions are in a script I don't even know the name of. However, so far I've had no problem solving them just from the images without understanding the text
06:12:24<pokechu22>Huh. I haven't seen that. Not sure how it decides what language to use.
06:13:01<dontwashyourhands>Every time I share a page, it's a different language
06:13:04<dontwashyourhands>*save a page
06:14:11<dontwashyourhands>I was wondering if this was a site-wide thing happening with archive.today, but I haven't seen anyone else report this behaviour recently
06:14:18<dontwashyourhands>So, it's probably somehow related to me switching ISPs
06:14:37BlueMaxima quits [Client Quit]
06:15:09<pokechu22>The increased number of captchas you're getting probably is related to switching ISPs, but the random languages thing probably isn't IMO. archive.today is a bit weird though so who knows
06:19:10<dontwashyourhands>it went from 0% captchas to 100% captchas, and then in randomly rotating languages, which is odd
06:19:31<dontwashyourhands>"and then": I actually mean at the same time, not after
06:19:39<dontwashyourhands>anyway, it's a minor inconvenience, I was mostly just curious
06:20:45<szczot3k>webmaster@archive.ph might be able to help you
06:23:58<dontwashyourhands>good tip
06:24:09<dontwashyourhands>not sure if I should bother him, though
06:24:52<szczot3k>They either will provide some feedback, or they'll ignore it
06:25:32DogsRNice quits [Read error: Connection reset by peer]
06:26:48<dontwashyourhands>thanks
06:58:01loug8318142 joins
07:26:53dontwashyourhands quits [Client Quit]
08:35:17i_have_n0_idea quits [Quit: Ping timeout (120 seconds)]
08:35:29i_have_n0_idea (i_have_n0_idea) joins
09:15:19nulldata quits [Quit: So long and thanks for all the fish!]
09:16:20nulldata (nulldata) joins
09:18:00balrog quits [Ping timeout: 250 seconds]
09:22:59Island quits [Read error: Connection reset by peer]
10:31:55<@OrIdow6>JAA: I'm only a bit faster than Slowpoke here but is there any hope for running coiledfist.org in AB as-is? I see you made a valiant effort but it looked like CF knows all the AB pipeline addresses?
12:00:03Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:00:05etnguyen03 (etnguyen03) joins
12:02:46Bleo18260072271962345 joins
12:14:13<eggdrop>[remind] OrIdow6: see if the a-cho job got http://www.a-cho.com/ac/res_2019.html and http://www.a-cho.com/ac/res_2020.html
12:58:58Stagnant_ quits [Remote host closed the connection]
12:59:03Stagnant_ (Stagnant) joins
13:07:37SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
13:09:14SkilledAlpaca418962 joins
13:20:42etnguyen03 quits [Client Quit]
13:24:34etnguyen03 (etnguyen03) joins
13:58:10th3z0l4 quits [Read error: Connection reset by peer]
14:14:53<@JAA>OrIdow6: Yeah, none of the attempts worked. I'm not sure what it's based on, but it's probably a combination of IP reputation and TLS fingerprinting, i.e. fingerprinting if the IP is classified as a datacentre or something like that.
14:53:41yourfate (yourfate) joins
14:59:13<yourfate>can I check the version of my warrior from the web UI? doesn't seem that way
14:59:26<yourfate>mine shut down unexpectedly, IDK when tbh, I don't see it in the proxmox log
15:00:07<yourfate>and before I start debugging that i'd like to make sure I have the latest and greatest
15:20:28lflare quits [Quit: Bye]
15:20:42etnguyen03 quits [Client Quit]
15:21:03lflare (lflare) joins
15:21:29etnguyen03 (etnguyen03) joins
15:33:22pixel leaves [Disconnected: Replaced by new connection]
15:33:34pixel (pixel) joins
15:52:07beardicus (beardicus) joins
15:56:48beardicus quits [Ping timeout: 260 seconds]
16:04:51beardicus (beardicus) joins
16:11:19Lord_Nightmare quits [Quit: ZNC - http://znc.in]
16:14:59Lord_Nightmare (Lord_Nightmare) joins
16:33:16<kpcyrd>just to be sure, can I upload to archive.org from ipv6 only computers?
16:33:37<kpcyrd>archive.org itself doesn't have AAAA records, but maybe the upload endpoints do?
16:34:40<kpcyrd>I can cope with ghcr.io not having ipv6, but if archive.org is also legacy-ip only I'm just going to pay 0.6€ extra for not having to bother with either
16:35:19<@imer>kpcyrd: doesn't look like it :( if you're asking about AT projects - our targets usually have (and prefer) ipv6
16:36:39<@imer>although projects working with just v6 are also an issue I guess
16:38:51<h2ibot>Cooljeanius edited TikTok (+46, /* Vital Signs */ update on US ban status): https://wiki.archiveteam.org/?diff=54247&oldid=54070
16:39:51<h2ibot>Cooljeanius edited TikTok (+6, /* Vital Signs */ oops, forgot to close my…): https://wiki.archiveteam.org/?diff=54248&oldid=54247
17:10:11<kpcyrd>imer: it's not an AT project unfortunately, I need to upload to archive.org myself </3
17:28:32BornOn420 quits [Remote host closed the connection]
17:28:33sec^nd quits [Write error: Broken pipe]
17:29:06sec^nd (second) joins
17:29:06BornOn420 (BornOn420) joins
17:45:26legoktm quits [Ping timeout: 250 seconds]
17:50:26legoktm joins
17:54:58adamus1red quits [Quit: SigTerm]
17:55:47notarobot1 quits [Quit: The Lounge - https://thelounge.chat]
17:56:09notarobot1 joins
17:58:14adamus1red (adamus1red) joins
18:08:34etnguyen03 quits [Client Quit]
18:08:38katocala quits [Ping timeout: 260 seconds]
18:08:51katocala joins
18:19:19etnguyen03 (etnguyen03) joins
18:24:00katocala quits [Ping timeout: 250 seconds]
18:24:36katocala joins
19:04:07HP_Archivist quits [Quit: Leaving]
19:16:51ducky quits [Read error: Connection reset by peer]
19:19:47ducky (ducky) joins
19:31:54<@OrIdow6>JAA: ah, tx
19:31:56<@OrIdow6>*thx
19:47:35ljcool2006 quits [Remote host closed the connection]
19:48:36ljcool2006 joins
20:12:18beardicus quits [Ping timeout: 260 seconds]
20:13:09DogsRNice joins
20:32:42benjins3 quits [Ping timeout: 250 seconds]
20:36:35beardicus (beardicus) joins
20:44:10anonchek joins
21:05:15etnguyen03 quits [Client Quit]
21:05:58anonchek quits [Ping timeout: 260 seconds]
21:19:58beardicus quits [Ping timeout: 260 seconds]
21:23:24Shyy quits [Ping timeout: 250 seconds]
21:55:42benjins3 joins
22:13:17Island joins
22:13:36midou quits [Read error: Connection reset by peer]
22:15:15lunik11 quits [Quit: :x]
22:15:48lunik11 joins
22:16:12<h2ibot>Himond000 edited Niconico (+158, /* Nico Nico Shunga */ about Shunga service…): https://wiki.archiveteam.org/?diff=54249&oldid=54245
22:17:12<h2ibot>Himond000 edited Niconico (+51, /* Nico Nico Seiga */): https://wiki.archiveteam.org/?diff=54250&oldid=54249
22:53:32beardicus (beardicus) joins
22:53:49<szczot3k>Do we have a boilerplate abuse response for AT?
22:55:06<szczot3k>I have abuse mailbox on my ovh nodes set to my contact, so I just got one
22:58:29etnguyen03 (etnguyen03) joins
22:59:26<@OrIdow6>Anyone able to access https://seiga.nicovideo.jp/ ? Subsection being deleted EOM per edits a few messages above
22:59:39<@OrIdow6>Gets me an error page from the US
23:00:19<@OrIdow6>... for context this site restricted access from foreign countries, in a way the announcement left kinda vauge, because foreigners didn't like that they had porn
23:00:31<szczot3k>EU/Poland: moves me to https://www.nicovideo.jp/region_restriction
23:00:55<@OrIdow6>Hmmm, that's a bad sign
23:01:38<@OrIdow6>Might have to get it from Japan
23:01:57<@OrIdow6>!remindme 4h add thing to dw
23:01:57<eggdrop>[remind] ok, i'll remind you at 2025-01-18T03:01:57Z
23:02:43Shyy joins
23:03:09<@imer>uk has the same redirect too
23:14:10<FireFly>EU/Germany, same here
23:14:32<FireFly>(which I guess tracks that it matches)
23:17:28tech234a quits [Quit: Connection closed for inactivity]
23:21:41<@JAA>CH, FI, NZ: same
23:32:40etnguyen03 quits [Client Quit]
23:44:38riteo quits [Ping timeout: 260 seconds]
23:46:45<pokechu22>US redirects too
23:46:55riteo (riteo) joins