00:02:07etnguyen03 (etnguyen03) joins
00:09:55loug83181422 quits [Quit: The Lounge - https://thelounge.chat]
00:19:25<aeg>has anyone archived the splits.io database?
00:23:42<pabs>"We will continue with our original shutdown plans. Splits.io's last day online will be March 31, 2025."
00:24:44<aeg>sure, but has anyone archived the database?
00:26:59pixel (pixel) joins
00:29:30<pabs>doesn't seem publicly available?
00:33:39<aeg>presumably it would need to be scraped through the public api
00:34:27<pabs>nothing in AB yet https://archive.fart.website/archivebot/viewer/?q=splits.io
00:35:29<pabs>if you have a URL list, we can run it there.
00:35:44<pabs>its also in Deathwatch, so someone might look nearer to the deadline, not sure if they would do the API too
00:37:29<aeg>doesn't archivebot archive only websites, crawler-style? i don't think that would be the right way to archive splits.io, it is a database backed site
00:39:18<pabs>it can do either crawls, or individual URLs, or lists of URLs (for eg every user in an API), or crawl a list of sites
00:40:06<pabs>if the database is public, we can download it with AB. if the data needs API calls, we can do those calls with the URL list option
00:40:43<pabs>I don't know enough about the site to figure out how to save the data tho
00:44:10<aeg>i don't know how many users splits.io has. speedrun.com has 2.3 million users; let's say splits.io has somewhere between 100k and 2 million. would archivebot take a list of that many urls?
00:45:04<aeg>each user would have numerous runs, but when i poked at the api a few weeks ago, it looked like the user endpoint returned all the associated run data
00:48:26<pabs>yeah, 100-200k is easy
00:48:39<pokechu22>It depends on if they have any rate-limiting
00:48:45<pabs>2mil should be fine too
00:49:05<pabs>yeah, especially with the deadline
00:53:20lennier2 quits [Client Quit]
00:57:51lennier1 (lennier1) joins
01:01:28th3z0l4_ joins
01:03:32th3z0l4 quits [Ping timeout: 250 seconds]
01:06:33<aeg>so then archivebot eventually uploads a warc to archive.org? how do you do quality control on the scrape results?
01:07:36<pabs>we watch the job for errors etc on http://archivebot.com/
01:07:59<pabs>and via other monitoring https://wiki.archiveteam.org/index.php/ArchiveBot/Monitoring
01:08:16<pabs>and yeah, everything ends up in web.archive.o.rg
01:08:39<pabs>and the warcs are on archive.org
01:09:30<pabs>folks can then extract data from either the WBM/CDX, or download/parse the warcs https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
01:10:22<pabs>https://archive.fart.website/archivebot/viewer/ is handy for finding the warc files for a job
01:11:02gust quits [Read error: Connection reset by peer]
01:11:03<aeg>what's the turn time from job initiation to warc availability?
01:11:44<nicolas17>unknown
01:11:53<pokechu22>Generally less than a day right now
01:11:57<pabs>usually a few days though, depending on the data size
01:12:40<pabs>(longer when IA ingestion is slow/stuck)
01:12:50<nicolas17>this reminds me...
01:13:12<pokechu22>I generally download pagination locally and then save the list of URLs I downloaded, as well as page contents from that pagination, via archivebot, rather than generating a list of pagination URLs, saving them, and then downloading the WARC and generating a list from that
01:13:16<pokechu22>but it depends on the site
01:14:15<pabs>I'd !a < the pagination URLs :) lazy route
01:14:27<pabs>not always possible though
01:14:46<pabs>btw pokechu22 did you document that sitemap trick somewhere? it definitely works?
01:14:49<pokechu22>Yeah, that's better if it works (but --no-parent and archivebot not liking to extract stuff from json makes that a problem)
01:14:56<pokechu22>It definitely works, haven't really documented it anywhere though
01:15:15<nicolas17>URLs that I saved via archivebot on nov 3, nov 5, nov 7, and nov 11 (2024) are *still* not up on IA, I guess the outage backlog is still being uploaded super slowly?
01:15:29<pabs>which parent gets used for the sitemap trick?
01:15:48<pokechu22>I think that specific backlog is still on separate storage and hasn't been getting uploaded by the normal means, but I'm not sure of the details
01:16:22<pokechu22>I'm not sure which parent gets used (it might be the sitemap itself on transfer.archivete.am), but if you have e.g. https://example.com in the list and example.com URLs in the sitemap it's fine
01:16:59<pokechu22>things would break if you pointed to an existing sitemap on the site that's in a subdirectory (which apparently isn't supposed to cover things not in that subdirectory according to sitemaps.org, but that doesn't stop sites from doing it anyways...)
01:17:06<aeg>splits.io has a code repository on github also. how would that get archived?
01:17:18<pokechu22>We can do github stuff in #gitgud
01:18:18notarobot1 quits [Quit: The Lounge - https://thelounge.chat]
01:18:57notarobot1 joins
01:30:17<h2ibot>PaulWise edited ArchiveBot (+474, add more archiving suggestions): https://wiki.archiveteam.org/?diff=55077&oldid=54522
01:32:17<h2ibot>PaulWise edited Software Heritage (+50, add bulk repos feature): https://wiki.archiveteam.org/?diff=55078&oldid=50949
01:34:12<pabs>pokechu22: the tips section might be a place to put the sitemap trick https://wiki.archiveteam.org/index.php/ArchiveBot#Usage_tips
01:34:58<pabs>aeg: once you have a list of URLs, put it in a descriptively named file like splits.io-all-user-api-urls.txt and upload it to https://transfer.archivete.am/
01:36:26egallager quits [Client Quit]
01:39:58etnguyen03 quits [Client Quit]
01:42:35kuroger quits [Quit: ZNC 1.9.1 - https://znc.in]
01:46:03chains joins
01:49:02kuroger (kuroger) joins
01:52:40<pabs>that_lurker: on the channels from this page that you operate, could you add the page to /topic (#hetzner-firehose for eg) https://wiki.archiveteam.org/index.php/Archiveteam:IRC/Relay
02:06:04BlueMaxima quits [Read error: Connection reset by peer]
02:09:22BennyOtt_ joins
02:09:22chains quits [Client Quit]
02:09:24BennyOtt quits [Ping timeout: 250 seconds]
02:10:23BennyOtt_ is now known as BennyOtt
02:11:02<aeg>splits.io also has a twitter (https://twitter.com/splitsio). can that get archived?
02:11:02<eggdrop>nitter: https://nitter.net/splitsio
02:14:19etnguyen03 (etnguyen03) joins
02:26:08notarobot1 quits [Client Quit]
02:27:25notarobot1 joins
02:33:20egallager joins
02:41:04<nicolas17>I think we haven't been able to archive twitter since elon took over
02:41:14<nicolas17>and added the login requirement
02:42:17notarobot1 quits [Client Quit]
02:43:31notarobot1 joins
02:52:44notarobot1 quits [Client Quit]
02:54:01notarobot1 joins
02:57:12etnguyen03 quits [Remote host closed the connection]
03:06:26lennier1 quits [Ping timeout: 260 seconds]
03:13:25lennier1 (lennier1) joins
03:22:36<h2ibot>Vitzli edited Radio Free Europe (+310, /* Estimated size */ Add missing @rferlonline): https://wiki.archiveteam.org/?diff=55079&oldid=55076
03:26:40lennier2 joins
03:30:21lennier1 quits [Ping timeout: 260 seconds]
03:33:48YooperKirks quits [Quit: Ooops, wrong browser tab.]
03:37:04<DigitalDragons>No kind of wireguard/etc is acceptable, right? Or is it more specific to public VPN services
03:56:11<TheTechRobo>DigitalDragons: It wasn't an official stance, but
03:56:12<TheTechRobo>[#warrior] <@J.AA> TheTechRobo: My view: if you control both ends, and if all worker traffic to the internet (including DNS) exits at the same place, and if that other place's internet connection is clean, and if that other place is exclusively used by you, it should be fine (unless I forgot about another condition).
03:56:18<TheTechRobo>DigitalDragons: It wasn't an official stance, but
03:56:18<TheTechRobo>[#warrior] <@J.AA> But it's easy to get the configuration wrong, so I still wouldn't recommend it.
03:56:30<TheTechRobo>...thank you, The Lounge
03:56:39<DigitalDragons>The Lounge--
03:56:41<eggdrop>[karma] 'The Lounge' now has -65 karma!
03:57:08<DigitalDragons>thanks :)
03:59:24<pabs>aeg: add to https://pad.notkiska.pw/p/archivebot-twitter
03:59:40<pabs>(mentioned on https://wiki.archiveteam.org/index.php/Twitter)
04:00:04<pabs>nicolas17: ^
04:00:58dendory quits [Quit: The Lounge - https://thelounge.chat]
04:01:23dendory (dendory) joins
04:01:25egallager quits [Client Quit]
04:04:40StarletCharlotte joins
04:06:38<StarletCharlotte>Hey, are there any programs which can extract files from .WARC files similarly to 7z?
04:06:48<StarletCharlotte>Preferably ones which support Windows?
04:08:02<StarletCharlotte>A friend of mine is trying to open a bunch of .warc.gz files from Archive Team and as it currently stands it looks as if they are close to unusable for the average person when not trying to pull one file at a time slowly through replayweb.page (if that even decides to work).
04:08:41<StarletCharlotte>Which it usually doesn't
04:10:07<StarletCharlotte>I'm seeing a lot of tools to put things into WARCs and convert to WARC, but not a lot to actually get things out of WARCs efficiently.
04:13:18<StarletCharlotte>Preferably with a GUI apparently
04:13:40<nicolas17>hm can't the 7-Zip app literally open warcs?
04:13:56<StarletCharlotte>Not that I've heard. Where'd you hear that?
04:15:26<aeg>pabs: i added it to the list. but is anyone actually grabbing those? the last date i see for anything done is 2024-06
04:16:24<StarletCharlotte>@nicolas17: Are you talking about this? https://www.tc4shell.com/en/7zip/edecoder/
04:17:16<pabs>aeg: not at the moment, we don't have a way to do it. Barto was thinking about setting up an archiving-only Nitter instance, but IIRC it would require registering lots of accounts, so a lot of ongoing work
04:17:18<nicolas17>some people dealing with the "teraleak" TestFlight warcs said 7zip worked but maybe they had some plugin installed; I never tried it myself
04:31:06StarletCharlotte quits [Ping timeout: 250 seconds]
04:35:10<aeg>may/should i create a page on archiveteam wiki (or somewhere) to document archival of splits.io?
04:36:05StarletCharlotte joins
04:36:34<Ryz>Not sure if it's been mentioned, but Twitter support was removed from https://socialblade.com/ on 2025 Mar 14 - https://twitter.com/SocialBlade/status/1900589770671071518
04:36:34<eggdrop>nitter: https://nitter.net/SocialBlade/status/1900589770671071518
04:37:01Webuser603791 quits [Quit: Ooops, wrong browser tab.]
04:42:07Shevrolet joins
04:42:11chains joins
04:42:51Shevrolet quits [Client Quit]
04:50:59<StarletCharlotte>@nicolas17: I see... I've been wondering, can those WARCs even be found anywhere anymore? I don't have much from the "Teraleak" (hate that name, sensationalist garbage) anymore.
04:51:22<StarletCharlotte>I at least saved what was relevant for Omniarchive (Minecraft archival group), but that's about it.
04:51:31<nicolas17>I saved the torrents https://data.nicolas17.xyz/testflight-torrents/
04:51:36<nicolas17>idk if anyone is still seeding them
04:51:38<nicolas17>haven't checked in a while
04:52:11<StarletCharlotte>I hope so.
04:52:52<h2ibot>Vitzli edited Voice of America (+1716, /* Youtube channels */ Add estimated sizes for…): https://wiki.archiveteam.org/?diff=55080&oldid=54985
04:53:01Webuser724271 joins
04:54:44<StarletCharlotte>Seeds "0 (2)" and peers "0 (1)", whatever that means
05:02:10chains quits [Client Quit]
05:11:46sparky14920 (sparky1492) joins
05:15:18sparky1492 quits [Ping timeout: 250 seconds]
05:15:19sparky14920 is now known as sparky1492
05:33:40flotwig quits [Quit: ZNC - http://znc.in]
05:37:18flotwig joins
06:00:49<that_lurker>pabs: Sure. Adding and info page about those relay channels has been on my todo :-)
06:02:06<that_lurker>Some of the channels still have eggdrop as op so I cannot change the titles in them
06:04:19<that_lurker>s/Some/Most
06:08:38<that_lurker>s/titles/topics
06:09:05Island quits [Read error: Connection reset by peer]
06:14:02sec^nd quits [Remote host closed the connection]
06:14:16sec^nd (second) joins
06:19:15Wohlstand (Wohlstand) joins
06:44:31egallager joins
06:52:32PredatorIWD25 quits [Read error: Connection reset by peer]
07:21:13ahm2587 quits [Quit: The Lounge - https://thelounge.chat]
07:21:33ahm2587 joins
07:34:11StarletCharlotte quits [Ping timeout: 260 seconds]
07:48:49StarletCharlotte joins
08:14:01PredatorIWD25 joins
08:17:56BearFortress quits [Ping timeout: 260 seconds]
08:21:01Grzesiek11_ joins
08:21:01Grzesiek11 quits [Read error: Connection reset by peer]
08:25:09HP_Archivist quits [Read error: Connection reset by peer]
08:25:31HP_Archivist (HP_Archivist) joins
08:40:56egallager quits [Client Quit]
08:47:07BearFortress joins
10:22:42BornOn420 quits [Remote host closed the connection]
10:23:16BornOn420 (BornOn420) joins
11:00:01Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
11:02:48Bleo18260072271962345 joins
11:28:47nine quits [Quit: See ya!]
11:29:00nine joins
11:29:00nine quits [Changing host]
11:29:00nine (nine) joins
11:30:58th3z0l4_ quits [Read error: Connection reset by peer]
11:32:07th3z0l4 joins
11:33:43SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
11:34:13SkilledAlpaca418962 joins
11:56:46FiTheArchiver joins
12:18:07vitzli (vitzli) joins
12:24:17<h2ibot>Vitzli edited Voice of America (+147, Add video links repository URL): https://wiki.archiveteam.org/?diff=55081&oldid=55080
12:25:17<h2ibot>Vitzli edited Radio Free Europe (+114, Add video links repository URL): https://wiki.archiveteam.org/?diff=55082&oldid=55079
12:26:18<h2ibot>Vitzli edited Radio Free Asia (+147, Add video links repository URL): https://wiki.archiveteam.org/?diff=55083&oldid=55065
12:29:03vitzli quits [Client Quit]
12:32:53PredatorIWD25 quits [Read error: Connection reset by peer]
12:35:05PAARCLiCKS quits [Quit: Ping timeout (120 seconds)]
12:35:06Miori quits [Quit: Ping timeout (120 seconds)]
12:35:53PredatorIWD25 joins
12:36:44SootBector quits [Remote host closed the connection]
12:37:06SootBector (SootBector) joins
12:37:24Miori joins
12:37:46PAARCLiCKS (s4n1ty) joins
12:47:18Wohlstand quits [Quit: Wohlstand]
13:06:04beastbg8 (beastbg8) joins
13:18:21StarletCharlotte quits [Ping timeout: 260 seconds]
13:48:39vitzli (vitzli) joins
13:51:33VoynichCR (VoynichCR) joins
14:00:30StarletCharlotte joins
14:02:38<h2ibot>Dango360 edited Discourse/uncategorized (+41, Added dcs.community): https://wiki.archiveteam.org/?diff=55084&oldid=54449
14:05:01StarletCharlotte quits [Ping timeout: 260 seconds]
14:38:23<PredatorIWD25>Given the new https://blog.cloudflare.com/ai-labyrinth/, it might be a good time to now revisit the possibilities of getting AT whitelisted by Cloudflare?
14:41:05<VoynichCR>has the ai labeyrinth good content?
14:44:45<h2ibot>VoynichCr edited WikiTeam (-1, logo): https://wiki.archiveteam.org/?diff=55085&oldid=55075
14:47:12SootBector quits [Client Quit]
14:48:46<h2ibot>VoynichCr uploaded File:Wikiteam3.png (https://github.com/saveweb/wikiteam3/): https://wiki.archiveteam.org/?title=File%3AWikiteam3.png
14:48:47<h2ibot>VoynichCr edited WikiTeam (+45, /* WikiTeam3 */ image): https://wiki.archiveteam.org/?diff=55087&oldid=55085
14:51:46<h2ibot>VoynichCr uploaded File:Wikiteam 2025.png (https://github.com/WikiTeam/wikiteam): https://wiki.archiveteam.org/?title=File%3AWikiteam%202025.png
14:52:43SootBector (SootBector) joins
14:52:46<h2ibot>VoynichCr edited WikiTeam (+28, image): https://wiki.archiveteam.org/?diff=55089&oldid=55087
14:58:47<h2ibot>VoynichCr uploaded File:Wikibot.png (https://wikibot.digitaldragon.dev/): https://wiki.archiveteam.org/?title=File%3AWikibot.png
14:59:47<h2ibot>VoynichCr edited Wikibot (+2, image): https://wiki.archiveteam.org/?diff=55091&oldid=54992
15:01:48<h2ibot>VoynichCr edited Software Heritage (+58): https://wiki.archiveteam.org/?diff=55092&oldid=55078
15:03:48<h2ibot>VoynichCr uploaded File:Software Heritage.png (https://www.softwareheritage.org/): https://wiki.archiveteam.org/?title=File%3ASoftware%20Heritage.png
15:04:38ThreeHM quits [Ping timeout: 250 seconds]
15:04:48<h2ibot>VoynichCr edited Software Heritage (+326, infobox): https://wiki.archiveteam.org/?diff=55094&oldid=55092
15:10:31vitzli quits [Client Quit]
15:26:25Ashurbinary joins
15:31:14sparky14920 (sparky1492) joins
15:34:51sparky1492 quits [Ping timeout: 260 seconds]
15:34:51sparky14920 is now known as sparky1492
15:40:55ThreeHM (ThreeHeadedMonkey) joins
15:49:43sparky14922 (sparky1492) joins
15:50:04Webuser299214 quits [Quit: Ooops, wrong browser tab.]
15:53:10sparky1492 quits [Ping timeout: 250 seconds]
15:53:11sparky14922 is now known as sparky1492
16:29:42us3rrr joins
16:33:02onetruth quits [Ping timeout: 250 seconds]
16:37:15szczot3k quits [Remote host closed the connection]
16:37:24szczot3k (szczot3k) joins
16:43:15szczot3k quits [Remote host closed the connection]
16:43:58szczot3k (szczot3k) joins
16:51:33gust joins
16:55:33egallager joins
16:56:38sparky14926 (sparky1492) joins
16:59:29StarletCharlotte joins
16:59:54sparky1492 quits [Ping timeout: 250 seconds]
16:59:55sparky14926 is now known as sparky1492
17:05:16StarletCharlotte quits [Ping timeout: 260 seconds]
17:05:26onetruth joins
17:09:21us3rrr quits [Ping timeout: 260 seconds]
17:21:02<nstrom|m>https://ip4.me shutdown notice, owner passed away
17:25:29StarletCharlotte joins
17:36:02StarletCharlotte quits [Remote host closed the connection]
17:36:20StarletCharlotte joins
17:41:31Island joins
18:20:01lennier2_ joins
18:22:51lennier2 quits [Ping timeout: 260 seconds]
18:25:21Hackerpcs quits [Quit: Hackerpcs]
18:27:00StarletCharlotte quits [Ping timeout: 250 seconds]
18:27:29<@JAA>I believe we've covered Kevin Loch's things already, yeah.
18:30:39Hackerpcs (Hackerpcs) joins
18:30:50StarletCharlotte joins
18:51:41<xkey>>
18:51:46<xkey>> Internet Archive Europe is a project by the Dutch non-profit research library Stichting Internet Archive. https://www.internetarchive.eu/
18:51:50<xkey>was this already discussed?
19:14:14Juest quits [Ping timeout: 250 seconds]
19:17:03itrooz (itrooz) joins
19:18:18<itrooz>Hey ! I was trying to download https://archive.org/details/archiveteam_github_20180704020939 but the download seem to be restricted (as well as other github archives). Does someone know why ?
19:19:13<nicolas17>arkiver: ^ is there a public answer for why WARCs are restricted?
19:19:41<nicolas17>I don't want to keep giving speculation and misremembering stuff that was said in other channels months ago :p
19:19:52<nicolas17>public/official answer*
19:19:52FiTheArchiver quits [Ping timeout: 250 seconds]
19:24:05<@arkiver>multiple reasons (usually one is more at play than the other in different cases). generally, data is accessible through the Wayback Machine for regular viewing. there may be problems with some of the data, which can then be handled by blocking viewing through the Wayback Machine. if the original WARCs are available for public download, we would either have to take the specific record out of the WARC, or make the entire WARC unavailable.
19:24:05<@arkiver>just letting the Wayback Machine (and their decision teams) handle decision on what is public and not works easier
19:25:03<@arkiver>second is the LLM training stuff we see lately. Archive Team is a huge juicy pile of data that can be made lots of money from by big AI companies. but that is not what web archives are for. if web archives are used commercially in that way, support will go away very fast and this will significantly hurt out ability to archive
19:26:13<@arkiver>in short - we just "archive stuff", and the responsibility for what to make public, how, and when, is given to someone else... which gives us quite some freedom, and a lot less headaches around data access, what can and cannot be public, exclusion requests, etc.
19:26:43<@arkiver>of course, we put some trust in the other party to do the right thing when it comes to these decisions.
19:27:21gust quits [Client Quit]
19:29:39<TheTechRobo>yeah, I suspect providers will be happier to allow us to scrape if the WARCs aren't public, exactly because of AI.
19:39:08Juest (Juest) joins
19:49:30Wohlstand (Wohlstand) joins
19:52:27VoynichCR quits [Quit: Ooops, wrong browser tab.]
20:04:59gust joins
20:17:46lennier2_ quits [Ping timeout: 260 seconds]
20:18:52lennier2_ joins
20:33:54<itrooz>Oh, I see. I was considering creating a project to index/view GitHub issues/discussions of projects that made them private (that sometimes happen, and important knowledge can be lost in that case), and that's why I was looking around for WARCs
20:35:45<itrooz>I'll look through the wayback machine apis, I can probably crawl through the pages manually if the rate limit allows it
20:53:00sparky14929 (sparky1492) joins
20:56:30sparky1492 quits [Ping timeout: 250 seconds]
20:56:31sparky14929 is now known as sparky1492
21:03:47<aeg>so if i provide the url list for archivebot to scrape the splits.io database... does this mean that the resulting warc won't be available to the public?
21:12:18<nicolas17>I think archivebot warcs are public
21:31:12Ashurbinary quits [Remote host closed the connection]
21:32:59StarletCharlotte quits [Remote host closed the connection]
21:39:05sparky14924 (sparky1492) joins
21:39:50sparky14925 (sparky1492) joins
21:42:26sparky1492 quits [Ping timeout: 250 seconds]
21:42:27sparky14925 is now known as sparky1492
21:43:44sparky14924 quits [Ping timeout: 250 seconds]
21:50:33etnguyen03 (etnguyen03) joins
21:57:35BlueMaxima joins
21:59:31<pokechu22>Yeah, archivebot warcs are public
22:05:41Wohlstand quits [Ping timeout: 260 seconds]
22:23:59etnguyen03 quits [Client Quit]
22:31:25klaffty joins
22:35:22BlueMaxima quits [Client Quit]
22:55:25chains joins
22:58:09matoro joins
23:10:43Ketchup902 quits [Remote host closed the connection]
23:10:55Ketchup901 (Ketchup901) joins
23:17:38flotwig quits [Quit: ZNC - http://znc.in]
23:18:42flotwig joins
23:28:13chains quits [Client Quit]