00:17:25Urgo (Urgo) joins
00:39:30systwi_ (systwi) joins
00:39:32systwi quits [Ping timeout: 252 seconds]
00:42:27graham joins
00:44:12systwi_ is now known as systwi
00:49:13graham quits [Client Quit]
00:50:20etnguyen03 quits [Ping timeout: 252 seconds]
01:04:50etnguyen03 (etnguyen03) joins
01:15:50BlueMaxima joins
01:37:33TheLovinator quits [Remote host closed the connection]
01:38:19<pokechu22>Looking at deathwatch, it says http://www.egloos.com/ is shutting down on June 16. Has any project been started for that?
01:40:00<@JAA>OrIdow6 wrote code for it, now it's lying on arkiver's desk.
01:42:41yasomi joins
01:45:44yasomi quits [Changing host]
01:45:44yasomi (yasomi) joins
01:48:11yasomi quits [Client Quit]
01:48:23yasomi (yasomi) joins
02:14:49dumbgoy joins
02:16:28etnguyen03 quits [Ping timeout: 265 seconds]
02:26:25etnguyen03 (etnguyen03) joins
02:36:55TunaLobster quits [Read error: Connection reset by peer]
02:58:31decky_e quits [Remote host closed the connection]
02:59:02decky_e (decky_e) joins
03:11:20dumbgoy quits [Ping timeout: 252 seconds]
03:32:50skyrocket quits [Ping timeout: 265 seconds]
03:39:20skyrocket joins
03:40:50etnguyen03 quits [Ping timeout: 252 seconds]
03:45:35Megame quits [Client Quit]
04:19:08decky_e quits [Remote host closed the connection]
04:19:40decky_e (decky_e) joins
04:37:32decky_e quits [Remote host closed the connection]
04:44:59decky_e joins
04:45:44<pabs>hmm, viewer is down with 504 gateway timeout https://archive.fart.website/archivebot/viewer/
04:46:13<pabs>oh, its back
04:48:43dumbgoy joins
04:59:05BlueMaxima quits [Client Quit]
05:26:37hitgrr8 joins
05:32:08Barto quits [Ping timeout: 252 seconds]
06:01:05IDK (IDK) joins
06:09:26pabs quits [Ping timeout: 265 seconds]
06:12:02pabs (pabs) joins
06:14:45Braven quits [Ping timeout: 265 seconds]
06:27:54spirit quits [Client Quit]
06:36:23Island quits [Read error: Connection reset by peer]
06:47:47Ivan2261 joins
06:49:04Ivan226 quits [Ping timeout: 265 seconds]
06:51:05<h2ibot>PaulWise edited Mailman2 (+42, add lists.einval.com): https://wiki.archiveteam.org/?diff=49906&oldid=49897
07:00:02nfriedly quits [Remote host closed the connection]
07:00:06<h2ibot>PaulWise edited Mailman (+200, add forum-dl issue, link to Perceval too): https://wiki.archiveteam.org/?diff=49907&oldid=49895
07:00:12<pabs>mikolaj|m: ^
07:35:12beario joins
07:35:18beario_ joins
07:43:23JackThompson3 quits [Ping timeout: 252 seconds]
07:43:47nito-kihk joins
07:46:13JackThompson3 joins
08:07:50TastyWiener95 quits [Quit: Ping timeout (120 seconds)]
08:08:42TastyWiener95 (TastyWiener95) joins
08:41:38decky_e quits [Read error: Connection reset by peer]
08:44:10nito-kihk quits [Client Quit]
08:44:22nito-kihk (nito-kihk) joins
09:12:32decky_e (decky_e) joins
09:23:06<diggan>any chance one could get an account at gitea.arpa.li? Want to publish the source code for the automatic infrastructure setup I have for my warriors (Terraform + Prometheus + Grafana + AT Warrior spread across multiple clouds)
09:23:54nfriedly joins
09:25:00<diggan>^^ Fusl as you seem to be the website owner
10:00:01railen63 quits [Remote host closed the connection]
10:00:16railen63 joins
10:21:15dumbgoy quits [Ping timeout: 265 seconds]
10:35:20<@arkiver>diggan: i'm going to ping JAA on that too
10:35:30<@arkiver>diggan: nice you want to share it :)
10:45:49etnguyen03 (etnguyen03) joins
10:59:37etnguyen03 quits [Client Quit]
10:59:46drin joins
11:02:30geezabiscuit quits [Ping timeout: 252 seconds]
11:02:30drin is now known as geezabiscuit
11:05:33drin joins
11:09:13geezabis- joins
11:09:35geezabiscuit quits [Ping timeout: 265 seconds]
11:10:02geezabis- is now known as geezabiscuit
11:11:50drin quits [Ping timeout: 252 seconds]
11:19:10drin joins
11:21:00<h2ibot>KnowledgeKraze edited ArchiveBot/Educational institutions/list (+193): https://wiki.archiveteam.org/?diff=49908&oldid=49872
11:21:44geezabiscuit quits [Ping timeout: 252 seconds]
11:21:44drin is now known as geezabiscuit
11:26:59icedice (icedice) joins
11:28:29vegbrasil quits [Remote host closed the connection]
11:38:14Letur quits [Ping timeout: 252 seconds]
11:46:46decky_e quits [Remote host closed the connection]
11:49:42vegbrasil joins
11:59:41c3manu (c3manu) joins
12:01:54<diggan>arkiver thanks :) always happy to share
12:02:47<diggan>experimenting with a reddit-lite browser for the megawarcs as well, to have a better (reddit-like) interface instead of having to go through Wayback, would be nice to publish that with the rest of the AT stuff too
12:07:21<nito-kihk>diggan: wouldn't it be better to use the pushshift data for this ?
12:09:23<diggan>nito-kihk my understand was that pushshift will no longer be updated
12:10:05<diggan>the latest version I found was 2023-03 I think
12:11:03<nito-kihk>ah yes true. yes it's the last version. I thought you meant for browsing historical data only
12:20:07<nito-kihk>i also wonder what will happen to the "https://www.reddit.com/r/all/comments.json" urls. because recording something like this would be than working with warcs
12:24:11etnguyen03 (etnguyen03) joins
12:30:42vegbrasi_ joins
12:34:20vegbrasil quits [Ping timeout: 252 seconds]
12:34:50icedice quits [Client Quit]
12:40:03icedice (icedice) joins
13:00:20vegbrasi_ quits [Remote host closed the connection]
13:00:49vegbrasil joins
13:03:21<cronfox>https://bcy.net/item/detail/7243752692219124791
13:03:21<cronfox>a chinese "ACGN" community "Half Dimension"
13:03:21<cronfox>2023/07/12 close server
13:10:25AmAnd0A quits [Ping timeout: 265 seconds]
13:10:47xkey quits [Client Quit]
13:11:08AmAnd0A joins
13:11:14xkey (xkey) joins
13:15:05diggan quits [Remote host closed the connection]
13:15:20diggan (diggan) joins
13:17:08Braven joins
13:46:33<masterX244>I wonder if it is doable to synthesize pushshift-compatible jsons out of the warcs
13:49:19<nito-kihk>well ultimately it's the same-ish data, so i think it should be possible. not necessarilly easy :)
13:54:53Letur joins
14:38:36<h2ibot>PaulWise edited Mailman2 (+42, add linuxmafia lists): https://wiki.archiveteam.org/?diff=49909&oldid=49906
14:45:16emberquill quits [Quit: The Lounge - https://thelounge.chat]
14:46:08emberquill (emberquill) joins
14:49:30Braven quits [Ping timeout: 265 seconds]
15:39:32phaeton joins
15:39:42phaeton leaves
15:40:36Island joins
16:02:50imer quits [Quit: Ping timeout (120 seconds)]
16:03:13imer (imer) joins
16:15:14<nicolas17>JAA: did you get origin.ka.cdn archived yet? I noticed two files were added in the last few days, cudos/activity/kpi/week/2023-06-11.csv cudos/activity/kpi/2023-06-11.csv
16:15:37<nicolas17>(and those filenames make me think it's internal metrics stuff that shouldn't be public :P)
16:16:42dumbgoy joins
16:20:44<fireonlive>key performance indicators eh :3
16:25:48<@JAA>diggan: Yes, soon, I'll have to do some things to Gitea tonight first.
16:26:04<@JAA>nicolas17: Not started yet, was waiting for some data to drain from my disks so I have enough space.
16:27:06<masterX244>nicolas17: what domain?
16:27:29<nicolas17>https://s3.amazonaws.com/origin.ka.cdn/
16:28:11<nicolas17>or one of the 3(!) different domains that have a CDN for that bucket :D
16:30:05<masterX244>small files, SPN can do that
16:30:51<masterX244>saved those 2 "should not exist" candidates
16:31:14<nicolas17>there's a total of 8TB in the bucket :)
16:32:21<imer>I could hold onto that for the time being, but I guess tooling is going to be the blocker again if warcs are wanted
16:32:56<@JAA>I have the space now, and there's time, so it doesn't even need to run very fast.
16:33:07<@JAA>It dedupes down to a bit over 3 TiB.
16:33:33<imer>ah, good. sounded like it'd be longer than a few minutes
16:33:34<masterX244>currently i'm short on space for large Crawls, too. got a TB of planetminecraft full onsite-mirroring (almost, got to repaginate a part to get some missing urls)
16:34:52<masterX244>(pagination WARCs will be "hidden" inside a tar or a zip when i upload since those are bad WARCs due to the shit they did at pagination where later pages (past 15k) are offset from the URL due to their anti-deeplink measure and 400 pages being lost to a error 500
16:35:26project10 joins
17:13:48yano quits [Quit: WeeChat, the better IRC client, https://weechat.org/]
17:15:16decky_e (decky_e) joins
17:16:44yano (yano) joins
17:52:56Twisty joins
17:53:06<Twisty>Hi guys
17:53:29<@arkiver>masterX244: feel free to upload to IA
17:53:48<Twisty>I was wondering, does the Youtube archive project archive links from all youtube videos in the long run?
17:55:00vegbrasi_ joins
17:56:22<Twisty>I, like so many others, have a bunch of links to removed YT videos and their details can only be found _sometimes_ by googling the url. I feel like it would only be natural to have these links and let's say, the video name and description archived. 🤔
17:58:00etnguyen03 quits [Ping timeout: 265 seconds]
17:58:29vegbrasil quits [Ping timeout: 265 seconds]
17:59:56vegbrasi_ quits [Ping timeout: 265 seconds]
18:08:21<masterX244>arkvier: wanted to get the entire batch ready, need to figure out a issue on the last step and some last loose ends need to be tied up, then its ready
18:08:38decky_e quits [Ping timeout: 265 seconds]
18:08:44<masterX244>(i avoid uploading uncomplete crawls when they are still in-progress)
18:08:55<pokechu22>Twisty: YouTube videos themselves, or links e.g. in their descriptions?
18:09:14decky_e (decky_e) joins
18:09:22vegbrasil joins
18:10:03<pokechu22>We want to eventually have a project to download youtube video metadata (and a separate project to download some videos, but that's a lot of data). The channel will be #down-the-tube, but I don't think there's been much progress on that project
18:10:24<pokechu22>In the mean time you can save google cache results and google search results at https://web.archive.org/save/
18:15:53vegbrasil quits [Ping timeout: 252 seconds]
18:23:20vegbrasil joins
18:28:32decky_e quits [Read error: Connection reset by peer]
18:29:04decky_e (decky_e) joins
18:38:36decky_e quits [Ping timeout: 265 seconds]
18:39:00decky_e (decky_e) joins
18:40:31decky_e quits [Remote host closed the connection]
18:46:16decky_e (decky_e) joins
18:51:08Barto (Barto) joins
18:56:19somerandomguy joins
18:57:45SuaveBet joins
19:01:52<somerandomguy>hello, i'm newbie here. do you advice to run warrior from Ukraine, is it considered country with major censorship? and does it matter if i want to archive site that's not banned, for example, reddit? also, i have shared IP, does it matter that much?
19:10:31etnguyen03 (etnguyen03) joins
19:11:06Urgo_ joins
19:11:38Urgo quits [Ping timeout: 252 seconds]
19:23:17Ivan2261 is now known as Ivan226
19:30:26Urgo_ is now known as Urgo
19:30:39pabs quits [Read error: Connection reset by peer]
19:31:48<Twisty>pokechu22 Thanks, yeah that makes sense. How I'd go on about doing this? Right now I use the ArchiveTeam Warrior thing and since Reddit seems to be taken care of now, I'd like to dedicate my time to that youtube details/metadata. I guess the Youtube project is my next best option, or the Google Sites?
19:32:33<pokechu22>I don't think there's an active project you can use with warrior for it yet. I'm also not entirely sure what the development status is (probably paused while dealing with the more pressing imgur and then reddit issues)
19:36:21<nstrom|m>Twisty: if you want to run something while reddit project is paused, telegram should work. there's plenty there to be done. I'd assume reddit will be back up soon
19:39:31<nstrom|m>a lot of the projects that show up in warrior aren't actively getting items fed into them even if they're not technically inactive
19:45:55AmAnd0A quits [Read error: Connection reset by peer]
19:46:12AmAnd0A joins
19:48:14<Twisty>True, I tried github and it didn't do anything, I assumed it's just _kinda done_. I'll do Telegram for now then. I'm curious, how would I go on and search for archived Telegram content? It's not like there are specific websites since it's more like a chat.
19:56:41<Twisty>btw what's the status on Reddit? I thought it's done with catching up to now since the major bulk of "to do" has been finished a couple hours ago, but now there is this massive "out" portion I can't wrap my head around. Are "out" tasks already assigned to machines and now wait for their data? If I select Reddit again, do I get to work on that big
19:56:42<Twisty>"out" portion or would I only archive most recent posts?
19:57:06<nicolas17>Twisty: reddit is currently paused while archiveteam admins adjust some stuff
19:57:35<Twisty>Well, I mean when it runs again :D
19:57:41<@JAA>#shreddit is the channel for that project.
19:57:56<nicolas17>"out" tasks were sent to a machine and now wait for their data, but it's possible it has been days and the machine will never return it, or it failed to download and dropped it, etc.
19:57:58<nicolas17>so yes, those get retried
19:58:46<nicolas17>or it failed because it was a private sub, so if we retry *now* it will fail again, so we'll need to do another round of retries when the strike ends anyway
19:59:45<Twisty>Right, so I won't be getting any of the "out" tasks, only the "to do"
20:00:00<nicolas17>if there's anything in todo, yes
20:00:17<Twisty>I see, thanks
20:00:18<nicolas17>afaik when "todo" runs out, the tracker automatically starts re-assigning the "out" tasks, oldest first
20:00:25<Twisty>ah
20:01:09<Twisty>I was wondering about that, so it's still a good option to keep running a project, even if it has 0 todo but 200mil out tasks
20:01:31<nicolas17>(imgur was a bit of a special case and admins had to manually move stuff from out into todo, or it would get stuck)
20:01:43<Twisty>Sounds like a nightmare
20:02:02decky_e quits [Ping timeout: 252 seconds]
20:02:19<nicolas17>yeah, the problem is that when a worker archived an image, it would also queue a thumbs:<id> item
20:02:37<Twisty>I don't know a whole lot about that but it surely doesnt sound good
20:02:45decky_e (decky_e) joins
20:02:45<nicolas17>and those were *not* being processed, so there was another process moving them from todo into a hidden queue (to be worked on later maybe)
20:03:07<Twisty>Is there a way to see what percentage of a site has been archived successfully? I guess it's difficult to pinpoint how large a site like reddit or imgur really is?
20:03:48<nicolas17>but until that move happened, if the tracker took a task from the database and it was thumbs:, it would ignore it and get another, and eventually give up... so it didn't get to see the todo empty in order to grab something from 'out' instead
20:04:40<Twisty>so basically ignore the actual thing that was supposed to be archived because the thumbnail was layered above, if that makes sense?
20:06:36<nicolas17>yeah, the thumbnails were being moved elsewhere, but until that happened they were on a higher priority queue than retries of old tasks
20:06:51<nicolas17>and that's not happening on reddit, so the reclaim of old tasks should work fine
20:08:02driib quits [Quit: The Lounge - https://thelounge.chat]
20:08:28<nstrom|m>yeah reddit project flows very smoothly without a ton of manual intervention
20:08:47driib (driib) joins
20:09:35<nstrom|m><Twisty> "True, I tried github and it didn..." <- see https://archive.org/details/archiveteam_telegram
20:09:54<nstrom|m>afaik we're archiving the 'preview' pages for public channels which are web accessible
20:11:23<nstrom|m>e.g. https://telegram.me/s/optom_tavarlar_turkiya for a channel, https://telegram.me/optom_tavarlar_turkiya/114771 for an individual post
20:11:36<nstrom|m>somewhat twitter like interface I guess. I've never really used telegram /shrug
20:11:52<nstrom|m>and all that stuff is feeding into waybackmachien
20:12:34<Twisty>i see, now i understand :D
20:14:25<Twisty>How does the archive determine how many tasks still need to be done? Did they index Telegram and came to the conclusion that there are still 600 mil groups unarchived?
20:15:02<nstrom|m>#telegrab:hackint.org for more questions on that project
20:15:58<nstrom|m>I know a lot of it was post discovery (basically enumerating through IDs) for public channels, and some is discovery of new channels from existing channels if they're referenced in a post
20:16:14<nstrom|m>fairly certain it's not an exhaustive list, a lot was manually targeted
20:16:52graham joins
20:17:05<nstrom|m>posts and channels are queued separately so a channel w a million posts counts as a million and one items in queue
20:17:09<Twisty>damn so there could be a lot more
20:17:42<nstrom|m>yeah believe so
20:21:03<Twisty>I did a brief google search a little while ago; Archiving via mobile isn't possible, right? I got a few old phones laying around that I could put to good use doing this sort of stuff instead of having my desktop machine consume 100x more power for the same task :D
20:21:45<nstrom|m>not currently. I know, would be nice
20:29:10katocala quits [Remote host closed the connection]
20:35:01c3manu quits [Client Quit]
20:35:09driib quits [Remote host closed the connection]
20:35:24jlwoodwa joins
20:35:29driib (driib) joins
20:40:21Catdurid joins
20:40:48pawbs joins
20:43:10moosetwin joins
20:45:11moosetwin quits [Remote host closed the connection]
20:46:14moosetwin joins
20:49:55driib quits [Remote host closed the connection]
20:50:14driib (driib) joins
20:54:31moosetwin quits [Remote host closed the connection]
20:58:10SuaveBet quits [Remote host closed the connection]
21:02:02katocala joins
21:08:02geezabiscuit quits [Ping timeout: 252 seconds]
21:09:49geezabiscuit (geezabiscuit) joins
21:12:06TheGleaner joins
21:12:27TheGleaner quits [Remote host closed the connection]
21:19:41somerandomguy quits [Remote host closed the connection]
21:34:51<h2ibot>Yts98 edited LINE BLOG (+139, Changed to upcoming): https://wiki.archiveteam.org/?diff=49910&oldid=49837
21:38:00Megame (Megame) joins
21:40:20etnguyen03 quits [Ping timeout: 265 seconds]
21:41:18Mateon1 quits [Ping timeout: 265 seconds]
21:46:50etnguyen03 (etnguyen03) joins
21:51:00Mateon1 joins
21:56:16decky_e quits [Ping timeout: 252 seconds]
21:56:37decky_e (decky_e) joins
22:00:04Hajdar quits [Remote host closed the connection]
22:00:22Hajdar (Hajdar) joins
22:02:24hitgrr8 quits [Client Quit]
22:11:32BlueMaxima joins
22:17:10em98 joins
22:17:45yts98 joins
22:18:12mtmustski joins
22:20:24em98 quits [Remote host closed the connection]
22:46:52jlwoodwa quits [Ping timeout: 252 seconds]
22:55:34<yts98>I discovered 1.2M blogs for lineblog.me: github.com/yts98/lineblog-items
22:55:42<yts98>Also, adding Japanese keyword items may help discover more blogs.
22:59:38railen63 quits [Remote host closed the connection]
23:02:35railen63 joins
23:03:20railen63 quits [Remote host closed the connection]
23:03:33<yts98>I just saw lineblog-grab created on AT's github. arkiver would you manage this project in the future?
23:03:33railen63 joins
23:09:22SebasTheCrab joins
23:09:54SebasTheCrab leaves
23:12:01railen63 quits [Remote host closed the connection]
23:12:28railen63 joins
23:17:12MactasticMendez joins
23:18:54pabs (pabs) joins
23:24:15MactasticMendez quits [Client Quit]
23:24:24MactasticMendez (MactasticMendez) joins