00:00:12 | <lemuria> | but since i already have uploads on the internet archive for some other completely unrelated website where i made a newspaper with references to that streamer (as in, searching their name in text contents would get you results), and that there's no evidence she hasn't seen them yet... maybe she wouldn't notice? |
00:00:26 | <nicolas17> | why is she removing videos? |
00:00:26 | <lemuria> | the analysis paralysis is real, i just wish i had the necessary balls |
00:00:39 | <lemuria> | channel reorganization or whatever, she didn't provide an explicit reason |
00:02:39 | <lemuria> | "i've been cleaning up my og youtube" - partial quote from her |
00:04:54 | | CuppyMan joins |
00:05:17 | <@OrIdow6> | ... personally I think it's more justifiable if it really is just that |
00:06:15 | | CuppyMan quits [Client Quit] |
00:06:21 | | Cuphead2527480 (Cuphead2527480) joins |
00:23:58 | <Guest> | might be how people remove their old videos for being "cringe" |
00:26:59 | <lemuria> | We must enforce "once it's on the internet, it's there forever" after all |
00:27:51 | <wickedplayer494> | ^^^ |
00:28:00 | <lemuria> | welp, time to press enter on the command. i hope i don't wake up to a ban from the streamer's community |
00:28:12 | <lemuria> | information wants to be free! |
00:29:31 | <lemuria> | and then it ends up in Community Texts and not Community Movies.. how does IA's collection system work anyway |
00:31:30 | <lemuria> | once my 35 Mbit/s internet upload speed catches up it should be at https://archive.org/details/nalani-proctor-baking-melodies |
00:31:38 | <@JAA> | You have to specify the collection at item creation time. |
00:32:15 | <@JAA> | opensource (= 'Community Texts') is the default. |
00:32:26 | <lemuria> | --metadata="collection:opensource_movies"?? |
00:32:34 | <@JAA> | If it's still on the first file upload, you could abort it and restart with ... that, yes. |
00:32:49 | <lemuria> | there's no going back now, description and info.json uploaded |
00:32:55 | <@JAA> | Ah, oh well |
00:39:34 | | etnguyen03 (etnguyen03) joins |
00:44:27 | <lemuria> | good night, here's to the upload not crashing and burning while i eep |
01:15:09 | <nicolas17> | lemuria: the mediatype and collection can only be changed by admins, you can email info@archive.org to get that fixed |
01:24:39 | <Ryz> | hexagonwin|m, Tistory websites generally have calendars to go on forever, which is why there's an ignore that caps it to not reach further 2050 and I think in further back to 2000 |
01:39:44 | | cyanbox joins |
01:57:05 | <hexagonwin|m> | @Ryz:hackint.org ah you mean like the calendar here on https://coconx.tistory.com/ ? But tistory has sitemap.xml which links to all post, so there should be no need to recursively download everything i believe.. |
02:15:37 | | aninternettroll quits [Ping timeout: 258 seconds] |
02:26:22 | | aninternettroll (aninternettroll) joins |
02:32:37 | | SootBector quits [Remote host closed the connection] |
02:33:48 | | SootBector (SootBector) joins |
02:33:49 | | Cuphead2527480 quits [Client Quit] |
02:46:31 | | etnguyen03 quits [Remote host closed the connection] |
02:57:22 | <Ryz> | hexagonwin|m, as long as it's set up where calendars don't get crawled or something, because otherwise, AB would've gotten up to https://coconx.tistory.com/archive/999912 |
03:01:09 | <hexagonwin|m> | I think it should be sufficient just downloading everything in https://coconx.tistory.com/sitemap.xml (and their page requisites), but the attachment files are also links so not sure about that. Maybe only download the links inside the article div?,, |
03:02:15 | <hexagonwin|m> | btw, do we also get all the pages for article list? like https://coconx.tistory.com/category/%EC%82%AC%EB%8A%94%20%EC%9D%B4%EC%95%BC%EA%B8%B0 to https://coconx.tistory.com/category/%EC%82%AC%EB%8A%94%20%EC%9D%B4%EC%95%BC%EA%B8%B0?page=31 |
03:02:33 | <hexagonwin|m> | would be great to have but actual blog posts should be prioritized i guess.. |
03:04:13 | | Island quits [Read error: Connection reset by peer] |
03:22:02 | <hexagonwin|m> | (unrelated to tistory) if it's possible, could someone please share the log for androidfilehost.com on archivebot? (like wpull.log on grab-site) I need the list of URLs it downloaded, so that it can be compared to my first attempt and also extract the total list of FIDs. Thanks. |
03:23:33 | <@JAA> | hexagonwin|m: The log is in the -meta.warc.gz on IA sometime after the job finishes. |
03:28:18 | <nicolas17> | or the cdx's |
03:32:40 | | wyatt8750 joins |
03:32:54 | | wyatt8740 quits [Ping timeout: 260 seconds] |
03:53:18 | | ScarlettStunningSpace quits [Read error: Connection reset by peer] |
04:01:49 | | Karlett (Karlett) joins |
04:10:37 | | APOLLO03 quits [Ping timeout: 258 seconds] |
04:12:29 | | APOLLO03 joins |
04:22:16 | | cyanbox_ joins |
04:25:16 | | Webuser654665 joins |
04:25:20 | | Webuser654665 quits [Client Quit] |
04:25:34 | | cyanbox quits [Ping timeout: 258 seconds] |
04:52:21 | <h2ibot> | Hans5958 edited Warrior projects (+9, Add #Y): https://wiki.archiveteam.org/?diff=57409&oldid=57403 |
04:53:21 | <h2ibot> | Hans5958 edited Warrior projects (-30, Put #Y on hiatus): https://wiki.archiveteam.org/?diff=57410&oldid=57409 |
04:56:22 | <h2ibot> | Hans5958 edited Warrior projects (+20, Put some 2025 projects started on 2025): https://wiki.archiveteam.org/?diff=57411&oldid=57410 |
05:51:59 | | BornOn420 quits [Quit: Textual IRC Client: www.textualapp.com] |
05:52:14 | | BornOn420 (BornOn420) joins |
06:17:59 | | BornOn420 quits [Ping timeout: 260 seconds] |
07:07:52 | <pabs> | hexagonwin|m: I do searching when archiving domains, for subdomains as well as related resources like twitter/github/mediawiki/etc |
07:16:42 | | Radzig2 joins |
07:18:04 | | Radzig quits [Ping timeout: 258 seconds] |
07:18:04 | | Radzig2 is now known as Radzig |
07:23:01 | | b3nzo joins |
07:27:20 | | Webuser754283 joins |
07:30:20 | | Webuser754283 quits [Client Quit] |
07:32:39 | | Commander001 quits [Ping timeout: 260 seconds] |
07:33:00 | | Commander001 joins |
07:44:31 | | Commander001 quits [Ping timeout: 258 seconds] |
07:45:03 | | Commander001 joins |
08:06:52 | | beastbg8_ joins |
08:09:59 | | beastbg8 quits [Ping timeout: 260 seconds] |
08:59:28 | | HP_Archivist (HP_Archivist) joins |
09:16:29 | | Suika quits [Ping timeout: 260 seconds] |
09:18:56 | | Suika joins |
09:32:13 | <@arkiver> | hexagonwin|m: please also make sure next time to put deadlines on https://wiki.archiveteam.org/index.php/Deathwatch |
09:35:18 | | Commander001 quits [Ping timeout: 258 seconds] |
09:36:10 | | Commander001 joins |
10:11:23 | | b3nz0 joins |
10:13:53 | | b3nzo quits [Client Quit] |
10:14:27 | | b3nz0 quits [Remote host closed the connection] |
10:18:47 | | b3nzo joins |
10:19:43 | | b3nz0 joins |
10:38:08 | <b3nzo> | sorry, sent it in #archiveteam. > JAA: whats the best way to pack many warc, meta-warc and cdx files? megawarc |
10:43:37 | <@JAA> | b3nzo: How many is 'many', and why do you want to pack them? |
10:45:56 | <b3nzo> | around 900GB, i want to upload them to the IA |
10:46:15 | <@JAA> | How many files? |
10:58:24 | <b3nzo> | around 90k |
11:00:03 | | Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat] |
11:02:48 | | Bleo182600722719623455222 joins |
11:08:04 | <@arkiver> | b3nzo: i'd say pack them into 100 GB chunks and upload those |
11:08:08 | <@arkiver> | keep the meta WARCs separate |
11:11:35 | <@arkiver> | we'll focus on the blogs that had no activity for 3 years |
11:11:37 | <@arkiver> | or maybe 2 years |
11:19:34 | | Commander001 quits [Ping timeout: 258 seconds] |
11:19:59 | <masterx244|m> | found another useful detail on wplace.... |
11:19:59 | <masterx244|m> | https://github.com/murolem/wplace-archiver |
11:19:59 | <masterx244|m> | That guy noticed that there are ipv6 shenanigans doable to get more ratelimit per host |
11:22:47 | | Commander001 joins |
12:03:03 | | Dada joins |
12:04:30 | <b3nzo> | does grab-site crawl despite specifying --1? not for all but for a few urls, the warc files are huge, some wikipedia articles are over 500GB. and specifically "grab-site --1 https://www.radiofrance.fr/ecouter-musique" is aroun 7.6GB, and has a bunch of mp3, mp4 from the same host |
12:53:48 | | nicolas17_ joins |
12:54:04 | | nicolas17 quits [Ping timeout: 260 seconds] |
13:07:26 | | nicolas17 joins |
13:08:39 | | nicolas17_ quits [Ping timeout: 260 seconds] |
13:37:11 | <@arkiver> | i have also posted the recent update in #archiveteam on opencollective |
13:40:58 | <@arkiver> | masterx244|m: is wplace something that is shutting down? |
13:47:48 | | HackMii quits [Remote host closed the connection] |
13:47:48 | | SootBector quits [Remote host closed the connection] |
13:48:10 | | HackMii (hacktheplanet) joins |
13:48:57 | | SootBector (SootBector) joins |
13:57:03 | | Oddly joins |
14:00:26 | <@arkiver> | let's make a channel for tistory |
14:00:30 | <@arkiver> | any ideas? |
14:01:05 | <@arkiver> | hexagonwin|m: not all posts on a tistory blog are in the sitemap, for example i don't see https://quizbang.tistory.com/3072 in the sitemap https://quizbang.tistory.com/sitemap.xml |
14:05:17 | <@imer> | just history is too simple I think, something with it is -> 'tis history? |
14:05:17 | <@imer> | "Behind The Name Tistory is a compound word consisting of T, the initial letter of Tattertools, and History. " #tatteredhistory? #tatteredstory? |
14:05:31 | | Oddly quits [Client Quit] |
14:07:08 | <@imer> | yeah that's all my creative juice used up I think |
14:10:08 | <@arkiver> | #tatteredstory is nice, nice incorporates that story |
14:10:11 | <@arkiver> | let's do that one |
14:10:21 | <@arkiver> | hexagonwin|m: FYI ^ |
14:10:31 | <@arkiver> | woah no JAA yet |
14:20:01 | <@arkiver> | hexagonwin|m: i see at the bottom of https://p.z80.kr/tistory_archiveteam.html you write about egloos. Archive Team did do a project for it as well, got 11 TB https://archive.org/details/archiveteam_egloos |
14:20:37 | <@arkiver> | but from the wiki page it looks like the site was only "partially saved" |
14:55:59 | | petrichor quits [Ping timeout: 260 seconds] |
14:58:42 | | petrichor (petrichor) joins |
15:08:13 | | Island joins |
15:26:02 | <h2ibot> | Arkiver uploaded File:Tistory-icon.png: https://wiki.archiveteam.org/?title=File%3ATistory-icon.png |
15:26:37 | | Karlett quits [Remote host closed the connection] |
15:33:17 | | cyanbox_ quits [Read error: Connection reset by peer] |
16:31:07 | <Hans5958> | <arkiver> "let's make a channel for tistory" <- There should be a one stop channel for anything blog services |
16:31:48 | <Hans5958> | I'd say #webroasting can be used, though it is from web hosting |
16:35:55 | <Hans5958> | Let's see if I would consistently call for using #webroasting for each web hosting that's going down in the future |
16:36:06 | <Hans5958> | I think it's been twice with this |
16:36:06 | <@arkiver> | Hans5958: i'm not sure about a one stop channel |
16:36:10 | <@arkiver> | the project can be very different |
16:36:36 | <@arkiver> | there could sure be a channel for more general coordination, but if we have a project for a certain service on a deadline, a dedicated temporary channel may be more fitting |
16:41:51 | <Hans5958> | I still think that every web hosting service share the same features and will be handled in the same manner on AT, so that's why it would be nice to put it in one channel to keep those who are interested in these web hosts in one place, as well as having info in one place |
16:42:43 | <Hans5958> | Though this is coming for me who rarely use the ephemeral nature of IRC, so I would agree to disagree to some motions |
16:43:48 | <hexagonwin|m> | arkiver: yeah back when i was archiving egloos i've also chatted here, shared "some" stuff iirc on #eggos. i think archiveteam only got the rss feeds and website (before shutdown on 2023/06/16) but i got every posts from known blogs by hammering their API long after shutdown (until about october) |
16:44:10 | <katia> | Hans5958: generally each tracker project gets its channel; also sometimes other wishes-for-project gets their channel and sometimes these wishes are more broad |
16:44:22 | <katia> | i guess that’s what’s happening here |
17:36:46 | <masterx244|m> | arkiver: JAA mentioned wplace a while ago with a archivebot crawl. its a living thing that changes consistently and if a crawl is pretty long its not consistent at all. luckily some datadumps exist already. currently mirroring one thats stored on github releases onto my own infra (might forward that to the IA as regular items once i got it synced |
17:41:30 | | Karlett (Karlett) joins |
17:42:14 | | @rewby quits [Ping timeout: 260 seconds] |
17:45:09 | | rewby (rewby) joins |
17:45:09 | | @ChanServ sets mode: +o rewby |
17:52:36 | <that_lurker> | easy channel name for tistory would be #thistory |
18:01:24 | | javascript17 joins |
18:02:00 | | HackMii quits [Ping timeout: 255 seconds] |
18:02:22 | | javascript17 quits [Client Quit] |
18:03:50 | | HackMii (hacktheplanet) joins |
18:07:47 | | javascript1 joins |
18:22:01 | | ATinySpaceMarine quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
18:22:28 | | ATinySpaceMarine joins |
18:26:55 | <lemuria> | Wplace? |
18:27:20 | <lemuria> | By the way the URL format for wplace tiles is https://backend.wplace.live/files/s0/tiles/X/Y.png where X and Y are numbers from 0-2048 |
18:27:25 | <lemuria> | 404 means tile is empty |
18:27:41 | <lemuria> | And it's zoom 11 |
18:35:31 | <masterx244|m> | yeah but they got annoying rate limits. luckily they fucked up on the IPV6 end of those limits and limit by /128 and not /64. having a way to spread requests across a /64 bypasses that easily |
18:36:13 | <masterx244|m> | (wrong buttflare config to our advantage since the tiles are cached on buttflare's HW) |
18:37:44 | <that_lurker> | Have they done a rug pull on the crypto yet? |
18:38:27 | <that_lurker> | or is it being archived proactively? |
18:40:17 | <masterx244|m> | was a proactive one afaik since the canvas evolves constantly |
18:45:30 | <b3nzo> | arkiver JAA: should i compress the megawarcs as gzip or zst? any preferred compression for the files |
18:46:30 | | emanuele6 quits [Read error: Connection reset by peer] |
18:47:35 | <masterx244|m> | zst requires pre-setup with a prepared dictionary and is more hassle for common users. .warc.gz is much easier to handle |
18:49:33 | <@JAA> | b3nzo: Why do you need to compress them? Are the input files not already compressed correctly? |
18:50:34 | <@JAA> | I don't believe megawarc lets you recompress with a different algorithm, and if there is any tooling out there, I'm not aware of and would be cautious about it. |
18:51:35 | <b3nzo> | not sure by what you mean by input files(maybe single warc.gz files), they are compressed but compressing thousands of files would be more efficient to store |
18:52:26 | <masterx244|m> | WARC files are intentionally compressed per-file inside and not as a continuous stream. when replaying from WARCs you need to load files that might be far down the WARC without reading gigabytes of previous files |
18:52:32 | <@JAA> | per-record* |
18:53:22 | <@JAA> | You could compress whole for cold storage, I suppose, but yeah, for anything that's supposed to be accessible, it's not an option. |
18:53:40 | <@JAA> | (And for .warc.zst, the spec *requires* per-record compression.) |
18:54:20 | | emanuele6 (emanuele6) joins |
18:57:27 | <b3nzo> | ah, so even incase IA wants to index the archives, non-megawarc/single warc files are the way to go? |
18:59:14 | <@JAA> | You can megawarc them, but the size should be exactly the sum of the small WARCs. A few bigger files are just much easier to manage than many small ones, both for you and for IA. |
19:00:29 | <b3nzo> | ah i see |
19:00:30 | <b3nzo> | thank you |
19:01:55 | <that_lurker> | always remember do not https://img.kuhaon.fun/u/tOxhpA.gif |
19:37:12 | | Rejoin_HP_Archivist joins |
19:40:39 | | HP_Archivist quits [Ping timeout: 260 seconds] |
19:53:13 | | makeworld4 is now known as makeworld |
20:01:54 | <h2ibot> | Pokechu22 edited Mailing Lists (+38, /* Software */ Sympa…): https://wiki.archiveteam.org/?diff=57413&oldid=57338 |
20:02:32 | <lemuria> | in random news, my router is definitely enjoying the upload speed exercise as i continue to archive what remains of nalani proctor's VODs |
20:03:13 | <lemuria> | deleting four years of VODs (and keeping the only copy on your backup hard drives; "your" referring to nalani's hard drives) is.. certainly a way to tear a hole in my historical record |
20:03:57 | <lemuria> | october 2024 complete. well, at least what remains of october 2024 |
20:46:44 | <Guest> | https://www.malwarebytes.com/blog/news/2025/08/national-public-data-returns-after-massive-social-security-number-leak |
20:46:55 | <Guest> | meant to post in #archiveteam-ot |
20:54:26 | | dabs joins |
21:09:41 | | dabs quits [Client Quit] |
21:12:35 | | atphoenix_ quits [Ping timeout: 258 seconds] |
21:13:44 | | atphoenix_ (atphoenix) joins |
21:39:48 | | emanuele6 quits [Read error: Connection reset by peer] |
21:47:46 | | emanuele6 (emanuele6) joins |
21:48:24 | | b3nz0 quits [Ping timeout: 260 seconds] |
21:50:27 | | etnguyen03 (etnguyen03) joins |
21:53:40 | | cyanbox joins |
22:17:09 | | atphoenix__ (atphoenix) joins |
22:19:19 | | atphoenix_ quits [Ping timeout: 260 seconds] |
22:19:58 | | Dada quits [Remote host closed the connection] |
22:23:44 | | BornOn420 (BornOn420) joins |
22:30:21 | | opl4 quits [Read error: Connection reset by peer] |
22:30:31 | | opl (opl) joins |
22:52:34 | | Church quits [Ping timeout: 260 seconds] |
23:11:27 | | etnguyen03 quits [Client Quit] |
23:17:58 | | Church (Church) joins |
23:31:34 | | HackMii quits [Remote host closed the connection] |
23:32:11 | | HackMii (hacktheplanet) joins |
23:42:44 | | luckcolors quits [Ping timeout: 260 seconds] |
23:47:43 | | nicolas17 is now authenticated as nicolas17 |
23:48:06 | | etnguyen03 (etnguyen03) joins |