00:03:49nexusxe (nexusxe) joins
00:13:08decky_e quits [Ping timeout: 252 seconds]
00:13:33decky_e (decky_e) joins
00:30:08nexusxe quits [Client Quit]
00:41:01sonick (sonick) joins
01:14:59cascode joins
01:21:02<joepie91|m>people being their usual fascist selves, as far as I can tell
01:22:24<joepie91|m>Elon found a new thing to be conspiratorial about (the Internet Archive) and his followership has latched on, with the not insignificant proportion of fascists among them having interpreted it as an invitation to think up antisemitic conspiracy theories, spread transphobic rhetoric, etc.
01:22:38<joepie91|m>this is a fairly clockwork thing with Elon unfortunatelyu
01:23:06<joepie91|m>just this time it's IA in the crosshairs
01:24:00<JTL>Not like IA doesn't have enough problems already :x
01:25:32<joepie91|m>yeah...
01:26:13joepie91|m mumbles something about fascists torching libraries historically
01:46:00dumbgoy_ joins
01:49:23dumbgoy quits [Ping timeout: 252 seconds]
02:08:39<TheTechRobo>Love the Elon stans in the replies. People say that anyone can exclude their own content from IA, then a bunch of people ask "WELL THEN WHY WAS THE CONTENT REMOVED"
02:09:04<TheTechRobo>How have humans survived this long?
02:17:23<nicolas17>TheTechRobo: I bet the stans also said Elon was 100% right in firing the people he fired and that Twitter is best off without them
02:17:34<nicolas17>I'd like to see them justify this now https://www.businessinsider.com/elon-musk-says-twitter-probably-rehire-some-laid-off-staff-2023-5
02:22:56TheTechRobo quits [Ping timeout: 252 seconds]
02:24:02TunaLobster joins
02:24:18TheTechRobo (TheTechRobo) joins
02:24:25cascode quits [Ping timeout: 265 seconds]
02:25:52pabs quits [Ping timeout: 265 seconds]
02:26:31pabs (pabs) joins
02:27:54cascode joins
02:47:56seefoe joins
02:49:05decky_e quits [Remote host closed the connection]
02:49:38decky_e (decky_e) joins
02:50:16<seefoe>Hey all, just wanted to check on the status of the warrior tracker for the imgur project. My warrior instance has been reporting "No HTTP response received from tracker. The tracker is probably overloaded." for the past 45 mins
02:50:18cascode quits [Read error: Connection reset by peer]
02:50:39cascode joins
02:52:12<seefoe>Looks like the leader-boards aren't loading for me either
02:53:48<andrew>seefoe: the tracker being down is known, but admins are asleep
02:54:06<seefoe>ahhh, okay, glad to hear it isn't just me -- thanks andrew
03:00:02seefoe quits [Client Quit]
03:00:18seefoe joins
03:01:49<icedice><TheTechRobo> Love the Elon stans in the replies. People say that anyone can exclude their own content from IA, then a bunch of people ask "WELL THEN WHY WAS THE CONTENT REMOVED"
03:03:07<icedice>Seems on brand. These are the people who go >muh freedumb of speech whenever some community tells them to fuck off and meanwhile they think that getting rid of Section 230 so that platforms will be liable for everything users posts is a great idea to get back on big tech
03:03:46<icedice>They don't think far enough that that would actually mean a lot more censorship, and not of the kind that is actually warranted
03:04:25<icedice>And that they'd be the first people who'd get fucked over by it since they're the biggest assholes around that can't avoid breaking rules
03:05:10<icedice>Thinking isn't exactly their strong suit
03:05:52<icedice>And Elon the man-child will do whatever he thinks will please his new fanbase
03:06:07<andrew>it's so unbelievably stupid, with these fascist influencers twisting themselves into pretzels calling Elon's claims "true"
03:06:15<andrew>(in the replies)
03:07:26dumbgoy_ quits [Ping timeout: 265 seconds]
03:13:14<icedice>https://www.memeatlas.com/images/brainlets/brainlet-crew-colorful.jpg
03:13:30<icedice>^ Visual representation of Elon and his fanbase
03:23:07<pokechu22>It's odd that both the archivebot control node and the warrior tracker for #imgone both died
03:26:34katocala quits [Ping timeout: 252 seconds]
03:27:12katocala joins
03:34:04BlueMaxima joins
04:06:33<@arkiver>Elon Musk spread a conspiracy theory that has unfortunately been around for some time.
04:06:48<@arkiver>I would like to refer to https://twitter.com/brewster_kahle/status/1659283393753006082
04:07:59<@arkiver>I also want to ask everyone to please not have political discussions here. Archive Team is not the place for that.
04:10:34cascode quits [Read error: Connection reset by peer]
04:10:42cascode joins
04:11:01cascode quits [Read error: Connection reset by peer]
04:11:23cascode joins
04:15:38eroc19901 (eroc1990) joins
04:15:41eroc1990 quits [Ping timeout: 252 seconds]
04:20:06cascode quits [Ping timeout: 252 seconds]
04:22:47cascode joins
04:36:47decky_e quits [Remote host closed the connection]
04:37:12decky_e joins
04:46:08tsblock (tsblock) joins
05:04:32archivist99 joins
05:04:38GNU_world quits [Ping timeout: 252 seconds]
05:06:30archivist99 is now known as GNU_world
05:23:20Island_ quits [Read error: Connection reset by peer]
05:30:29<seefoe>looks like it's back up
05:33:04cascode quits [Ping timeout: 252 seconds]
05:35:56archivist99 joins
05:35:57fullpwnmedia quits [Read error: Connection reset by peer]
05:36:11fullpwnmedia joins
05:36:18GNU_world quits [Ping timeout: 265 seconds]
05:51:27cascode joins
06:01:46Unholy2361 quits [Quit: Ping timeout (120 seconds)]
06:05:01archivist99 is now known as GNU_world
06:13:20hitgrr8 joins
06:43:48killsushi joins
06:44:12seefoe quits [Ping timeout: 252 seconds]
06:45:07seefoe joins
06:54:55BlueMaxima quits [Client Quit]
06:55:12cascode quits [Ping timeout: 252 seconds]
06:55:46cascode joins
07:13:32cascode quits [Ping timeout: 252 seconds]
07:14:31spirit quits [Client Quit]
08:12:20<Ryz>Huh...? Just the English and Chinese language version of Niconico is going to be discontinued at the end of 2023 June? https://blog.nicovideo.jp/niconews/192660.html
08:12:30<Ryz>Might need investigating if there's anything more than that ><;
08:14:02<Exorcism|m>whaaaat, but what is the point of removing translations 😭
08:36:58spirit joins
08:51:03decky_e quits [Read error: Connection reset by peer]
08:57:13decky_e (decky_e) joins
09:23:29Minkafighter quits [Quit: The Lounge - https://thelounge.chat]
09:23:42Minkafighter joins
09:54:52umgr036 joins
09:55:43umgr036 quits [Remote host closed the connection]
09:55:56umgr036 joins
10:10:39user__ joins
10:13:15umgr036 quits [Ping timeout: 265 seconds]
10:34:23tsblock quits [Read error: Connection reset by peer]
10:36:57decky_e quits [Remote host closed the connection]
10:40:45s-crypt quits [Remote host closed the connection]
10:40:45flashfire42 quits [Remote host closed the connection]
10:40:45kiska quits [Remote host closed the connection]
10:40:45Ryz2 quits [Remote host closed the connection]
10:41:04Ryz2 (Ryz) joins
10:41:06s-crypt (s-crypt) joins
10:41:09flashfire42 (flashfire42) joins
10:42:18kiska (kiska) joins
10:57:38BearFortress quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
11:02:06<icedice>Right, sorry arkiver
11:02:56<icedice>Musk just annoys the hell out of me, especially when going after organizations like Internet Archive that just try to make things better
11:03:13<icedice>I'll leave it at that
11:08:14BearFortress joins
11:09:33<icedice>Is the Imgur API still allow grabbing 12 500 downloads per day and 500 downloads per hour or have they limited that recently?
11:14:10tbc1887_ quits [Client Quit]
11:14:30tbc1887 (tbc1887) joins
11:22:47<@OrIdow6>Anyone looked into Splice Studio (on DW, the 31st)? I think that section of the website is login-restricted but I'm not entirely sure
11:42:11<h2ibot>TheTechRobo edited Strawpoll.me (+19, Link to my uploaded data): https://wiki.archiveteam.org/?diff=49804&oldid=49414
12:21:29spirit quits [Client Quit]
12:39:19sonick quits [Client Quit]
13:24:48<Thibaultmol>so which warrior/grab projects are currently actively getting jobs?
13:24:48<Thibaultmol>imgur, telegram, reddit, urls
13:24:48<Thibaultmol>Am I missing any?
13:29:29lennier1 quits [Ping timeout: 265 seconds]
13:30:33lennier1 (lennier1) joins
13:48:04Unholy2361 (Unholy2361) joins
13:49:35umgr036 joins
13:49:53user__ quits [Ping timeout: 252 seconds]
14:03:02<nstrom|m>urlteam is kinda always active but probably has way more workers than work
14:08:15<icedice>What's going on with the ArchiveBot tracker being down?
14:41:14Arcorann quits [Ping timeout: 252 seconds]
15:04:03spirit joins
15:23:07<@kaz>not a lot
15:23:13<@kaz>is down, will be back in the future no doubt
15:24:06Unholy23614 (Unholy2361) joins
15:27:47Unholy2361 quits [Ping timeout: 252 seconds]
15:27:47Unholy23614 is now known as Unholy2361
15:31:19Island joins
15:58:10BigBrain_ (bigbrain) joins
15:58:21BigBrain quits [Ping timeout: 245 seconds]
16:06:02<icedice>Ok
16:06:03icedice quits [Client Quit]
16:28:41eroc19901 is now known as eroc1990
16:45:53jspiros quits [Ping timeout: 252 seconds]
16:46:47jspiros (jspiros) joins
17:01:01dumbgoy_ joins
17:31:46decky_e (decky_e) joins
17:35:50tbc1887 quits [Read error: Connection reset by peer]
17:43:41<mikolaj|m>I've uploaded forum-dl v0.1.0 to PyPI (can be installed with pip install forum-dl now)
17:44:58<mikolaj|m>if anyone's interested in testing it and leaving a comment, feature request, or a bug report -- I'd appreciate that greatly
17:45:08<mikolaj|m>the repository is here, of course: https://github.com/mikwielgus/forum-dl
17:45:45<mikolaj|m>it cannot dump WARCs yet, unfortunately, it's a priority for v0.2.0
17:46:47that_lurker quits [Client Quit]
17:47:40that_lurker (that_lurker) joins
17:48:58decky_e quits [Ping timeout: 252 seconds]
17:49:23decky_e (decky_e) joins
18:10:39Miori joins
19:11:01HP_Archivist (HP_Archivist) joins
19:12:39decky_e quits [Ping timeout: 265 seconds]
19:18:57decky_e (decky_e) joins
19:38:34<andrew>is there any way to have grab-site re-crawl some section of a site without re-downloading other pages that have already been downloaded?
19:43:43<@JAA>andrew: Depends on your definition of 'section'. If it's a certain path, the easiest way would just be to run a separate crawl starting from that. If it's less easily separated, a full crawl with broad ignores might work. You could also mess with the wpull DB to mark URLs as todo again and then resume the crawl, but you might want to do that in a separate dir with a copy of the data.
19:46:24<andrew>JAA: wait, you can resume a crawl?
19:47:45<@JAA>Well, sort of, but not properly.
19:48:46<@JAA>You can manually run wpull (using the command generated by grab-site with some option), but it has problems. Cookies will be blank for the restart, for one.
19:49:42<@JAA>Not sure if grab-site makes wpull write out cookies when it finishes normally nor what happens if you stop a crawl.
19:49:59<@JAA>Suffice to say that it's a messy process.
19:59:16<andrew>JAA: how badly will grab-site barf if I give it a 50 million line ignore list?
20:00:26<@JAA>andrew: Not sure if re2 has any pattern size limits, but other than that, you'd just be limited by RAM size I guess. I bet the performance would be abysmal though.
20:00:34<andrew>or would it be better to start a crawl then import the URL list into wpull.db
20:03:05<@JAA>I suppose that would be another option, yeah.
20:04:25<andrew>ERROR Fatal exception.
20:04:25<andrew>Traceback (most recent call last):
20:04:25<andrew> File "/nix/store/sgpdf09ql5ik9zvdw8dgxplqgp5k02h9-python3.8-SQLAlchemy-1.3.24/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
20:04:25<andrew> self.dialect.do_execute(
20:04:25<andrew> File "/nix/store/sgpdf09ql5ik9zvdw8dgxplqgp5k02h9-python3.8-SQLAlchemy-1.3.24/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
20:04:26<andrew> cursor.execute(statement, parameters)
20:04:26<andrew>sqlite3.DatabaseError: database disk image is malformed
20:04:42<andrew>oops, I probably shouldn't have tried poking around the database while it was running
20:05:44umgr036 quits [Remote host closed the connection]
20:05:58umgr036 joins
20:06:13<andrew>well shit, what do I do now to resume this crawl? :P
20:13:19<@JAA>RIP
20:13:51<andrew>welp, I guess it's time to see what re2 can do :P
20:14:01<andrew>WCGW?
20:59:15<andrew>update: it seems re2 does not like my 754k line ignores file, it has printed "/build/source/re2/simplify.cc:225: CoalesceWalker::ShortVisit called" to the console thousands of times
21:00:07<@JAA><surprised_pikachu.png>
21:00:35<andrew>/ Should never be called: we use Walk(), not WalkExponential().
21:00:39<andrew>well that's comforting
21:00:42<@JAA>Yeah, just saw that as well. lol
21:00:47Unholy2361 quits [Client Quit]
21:01:08Unholy2361 (Unholy2361) joins
21:04:34<andrew>great, now Sublime Text (UNREGISTERED) is not responding
21:07:48<joepie91|m>arkiver: honestly, I don't think that "don't have political discussions" is the right policy here. what Archive Team does is fundamentally political, and political circumstances affect both Archive Team and the Internet Archive (as is obvious from this case of Musk targeting IA, for example). this has also historically been true, with eg. institutions of knowledge and information often being the first to go under authoritarian rule. I can
21:07:48<joepie91|m>understand that certain rhetoric, behaviour or viewpoints may not be welcome here, but "no political discussions" seems to me an extremely poor capture of that idea, and one that indirectly makes it impossible to discuss the very real threats that affect not just IA but also the people here.
21:08:21<joepie91|m>if there's certain rhetoric/views/etc. that are not wanted here, then I would suggest calling those by their name instead
21:08:48<joepie91|m>(I'm not opposed to that, to be clear - just arguing that trying to phrase it as "no politics" has unwanted collateral effects)
21:12:58<fireonlive>754k!
21:13:17<fireonlive>i feel better about my 42 rules now
21:13:29<andrew>surprisingly, Notepad++ is handling this much more gracefully
21:16:22<masterX244>when you stomp onto unexpected bugs due to oversized data
21:17:08<andrew>Notepad++ has some graphical glitches with the length of these lines :P
21:18:32<andrew>keystroke latency is a perfectly fine 7 seconds
21:18:38TunaLobster quits [Read error: Connection reset by peer]
21:19:44<andrew>okay, I optimized the regex down to 30 megabytes
21:19:49<andrew>let's see if it works
21:20:36<andrew>okay, turns out terminals don't like it when you print 18 megabytes on a single line
21:20:52<CrispyAlice2>who woulda thought
21:21:10<threedeeitguy>OoO
21:22:37<fireonlive>mikolaj|m: very interesting, thank you!
21:22:37<andrew>grab-site quickly hit 3.1 GiB of memory
21:22:52<andrew>oh come on - /build/source/re2/simplify.cc:225: CoalesceWalker::ShortVisit called
21:22:56<masterX244>seems like the bot is still munching
21:23:20<masterX244>wrong chat, shit
21:24:19<andrew>hey maybe it worked, wpull.db is filling up
21:27:10<joepie91|m>hm, oops, apparently the aforementioned topic had moved to -ot and I didn't notice
21:28:20Billy549 (Billy549) joins
21:35:39seefoe quits [Client Quit]
21:48:43seefoe joins
22:01:38<andrew>okay, wow, this regex performs so poorly the HTTP requests seem to be timing out (?)
22:02:09<andrew>this is not good
22:03:15<spirit>maybe there are some tools to build more efficient expressions from a big corpus like that
22:03:24<spirit>if there is any chance :D
22:03:28<@JAA>I was wondering how long it would take to match even a single URL.
22:03:30<andrew>oh, what if I passed --exclude-directories to wpull?
22:19:48<fireonlive>are transfer.archivete.am files expired after a while?
22:20:40seefoe is now known as rohvani
22:21:13<@JAA>fireonlive: No
22:21:24<fireonlive>kk :)
22:21:45<fireonlive>this is archiveteam afterall i suppose haha
22:22:03<masterX244>we don't advertise it to avoid abuse
22:22:45<@JAA>Yeah, the homepage also mentions a size limit that doesn't exist.
22:23:32<imer>i call bs on that one JAA, i may have accidentally tried to upload non-zstd files and it timed out :p
22:24:15<imer>although that might be peering limitations, can't seem to get it to upload faster than ~50mbit/s
22:24:17<pokechu22>oh, yeah, the front page claims "Files stored for 1 day" which is definitely not the case
22:24:46<pokechu22>I also like the "contact us" link that goes nowhere
22:24:50<@JAA>imer: Yeah, but that's a bug with the CDN, not an actual restriction.
22:25:11<imer>well, same difference
22:25:35<@JAA>There is a workaround that allows uploading arbitrarily large files. It's just not advertised publicly.
22:25:42<imer>probably a good thing
22:33:05hitgrr8 quits [Client Quit]
22:34:49<fireonlive>makes sense
22:37:10killsushi quits [Ping timeout: 252 seconds]
23:04:09BlueMaxima joins
23:24:18<fullpwnmedia>nicolas17 is there actually a random video on the server?
23:25:15<fullpwnmedia>JAA how'd you go with the pulling from the dynabook ftp?
23:26:37<@JAA>fullpwnmedia: I just threw it into ArchiveBot. It should be in the Wayback Machine by now.
23:26:53<fullpwnmedia>all of it?
23:27:13<nicolas17>fullpwnmedia: Upload/74580960.mp4
23:27:22<nicolas17>added may 16
23:27:25<@JAA>Is something missing?
23:27:37<nicolas17>it's the only change since I first mirrored it
23:28:04<fullpwnmedia>JAA it should be good. thanks for that
23:28:16<fullpwnmedia>nicolas17 whats the ftp url again?
23:28:38<nicolas17>https://uk.dynabook.com/generic/general-new-ftp-and-software-guide-sheets/
23:29:10<nicolas17>here's the video https://transfer.archivete.am/LMo8u/74580960.mp4
23:29:47<fullpwnmedia>istg if its some nsfw shit
23:30:00<fullpwnmedia>nvm lmao
23:30:18<fullpwnmedia>thats the most random video ive seen in a while
23:30:30<nicolas17>it's actually a dynabook laptop tho
23:30:48<fullpwnmedia>it is but its funny that someone just put it there
23:31:08<fullpwnmedia>i wonder if it was done by support
23:33:33<fullpwnmedia>JAA sorry, how do i access the files on the wayback like the directory list? do i just put the same user and password on the prompt on wayback?
23:35:06<@JAA>fullpwnmedia: You can't really access the dir listing. Username and password are ignored by the WBM (as is the protocol, actually). But https://web.archive.org/web/20230507*/ftp://www.tb2b.eu/* should help.
23:35:39<fullpwnmedia>gotcha gotcha. so all i need is just the file path
23:36:28<@JAA>The dir listing would in theory be available at https://web.archive.org/web/20230507035244/ftp://www.tb2b.eu/ but the WBM forces it to a download instead of displaying the contents.
23:36:37<nicolas17>I have 6973 files+directories here
23:36:57<fullpwnmedia>fair enough
23:36:59<nicolas17>that WBM page says there are 6977 captured URLs so it should be complete, but I'm left wondering what the 5 extra are
23:37:07<nicolas17>er 4 extra I can't math today
23:37:16<pokechu22>view-source:https://web.archive.org/web/20230507035244id_/ftp://www.tb2b.eu/
23:37:52<fullpwnmedia>nicolas17 could it be the html for the directory listings?
23:38:04<pokechu22>at least in firefox, that bypasses it being downloaded
23:38:07<@JAA>Oh right, view-source bypasses the download forcing.
23:38:20<nicolas17>well I'm counting directories too
23:38:28<@JAA>More accurately, the WBM doesn't actually force anything, it just serves it with 'content-type: application/octet-stream'.
23:38:33<andrew>so update on the grab-site exclusion thing - I found the source code that handles --exclude-directories and it looks prohibitively slow for the number of exclusions I have
23:38:37<@JAA>And the browser generally won't know what to do with that.
23:38:48<pokechu22>but you also need to change the url to specify id_ or if_ because otherwise you just get the source for the WBM outer frame: view-source:https://web.archive.org/web/20230507035245/ftp://www.tb2b.eu/Deployment_Files/
23:38:59<fullpwnmedia>youre a real one for that
23:49:02Nulo_ joins
23:51:44Nulo_ quits [Read error: Connection reset by peer]
23:51:49Nulo_ joins
23:52:01Nulo quits [Ping timeout: 265 seconds]
23:52:02Nulo_ is now known as Nulo