00:00:45Matthww quits [Client Quit]
00:03:21Matthww joins
00:03:56Matthww quits [Remote host closed the connection]
00:13:21etnguyen03 (etnguyen03) joins
00:18:03loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
00:26:53etnguyen03 quits [Client Quit]
00:38:10HP_Archivist quits [Quit: Leaving]
00:43:50beardicus (beardicus) joins
01:01:33etnguyen03 (etnguyen03) joins
01:15:08beardicus quits [Ping timeout: 260 seconds]
01:23:19beardicus (beardicus) joins
01:33:19hackbug quits [Remote host closed the connection]
01:34:12hackbug (hackbug) joins
01:48:39nine quits [Quit: See ya!]
01:48:52nine joins
01:48:52nine quits [Changing host]
01:48:52nine (nine) joins
02:07:03beardicus quits [Ping timeout: 260 seconds]
02:22:56beardicus (beardicus) joins
02:30:48sec^nd quits [Remote host closed the connection]
02:30:52SootBector quits [Remote host closed the connection]
02:30:53HP_Archivist (HP_Archivist) joins
02:31:08sec^nd (second) joins
02:31:11SootBector (SootBector) joins
02:31:59StarletCharlotte quits [Read error: Connection reset by peer]
02:33:01<pokechu22>Do we have any good way of recording a bunch of POST requests with known data and URLs into a WARC? I generated a list of those for https://sleepnomoreauction.com/ yesterday, but I'm only able to save the images with archivebot since everything else is POST (note that they also require an origin header, and possibly some other headers)
02:42:08etnguyen03 quits [Client Quit]
02:49:44<TheTechRobo>Should be theoretically easy with qwarc from my previous research, but /me doesn't currently have time to write a spec file
02:50:43pabs quits [Read error: Connection reset by peer]
02:52:05<@OrIdow6>Could also be done with wget-at
02:52:17graham9 joins
02:52:25pabs (pabs) joins
02:53:06etnguyen03 (etnguyen03) joins
02:58:10<@JAA>Yeah, easy with qwarc.
02:59:18<pabs>-feed/#hackernews- Society for Technical Communication to permanently close its doors https://www.stc.org/ https://news.ycombinator.com/item?id=42867324
03:01:47<h2ibot>OrIdow6 edited Niconico (+432, /* Nico Nico Seiga */ Nico Nico Shunga has done…): https://wiki.archiveteam.org/?diff=54287&oldid=54264
03:03:16<TheTechRobo>AttributeError: module 'asyncio' has no attribute 'coroutine'. Did you mean: 'coroutines'?
03:03:16<TheTechRobo>What version of Python should I use for qwarc?
03:03:43<TheTechRobo>(I'm on the latest commit in the 0.2 branch)
03:04:38<@JAA>Hmm yeah, that was removed in 3.11.
03:04:58<@JAA>FWIW, that isn't used in qwarc's code, so it'll be from a dependency.
03:05:06<@JAA>Probably the ancient aiohttp.
03:05:09<TheTechRobo>Ah, yeah, aiohttp
03:06:26<@JAA>The aiohttp code is ugly because it doesn't expose the raw HTTP traffic, so it hard-depends on that ancient version.
03:06:56<@JAA>I usually run my things under 3.6, but I know 3.8 works fine. Not sure about newer ones.
03:06:56<TheTechRobo>On 3.9 I get
03:06:56<TheTechRobo> class CeilTimeout(Timeout):
03:06:57<TheTechRobo>TypeError: function() argument 'code' must be code, not str
03:07:01<TheTechRobo>Also in aiohttp.
03:07:12<@JAA>Not in async-timeout?
03:07:25<TheTechRobo>/home/thetechrobo/qwarc/venv/lib/python3.9/site-packages/aiohttp/helpers.py
03:07:36<@JAA>Hmm yeah, I guess that's where the error happens.
03:07:40<@JAA>You need async-timeout==3.0.1.
03:07:49<h2ibot>OrIdow6 edited Web Roasting (+283, Explain what it is a bit): https://wiki.archiveteam.org/?diff=54288&oldid=30443
03:08:09<@JAA>https://github.com/aio-libs/aiohttp/issues/6320
03:08:11pie_ quits []
03:09:43<nicolas17>let's build our own library
03:09:54<nicolas17>with h11, blackjack and hookers
03:10:02<@JAA>That's the plan, yes.
03:10:52<TheTechRobo>Can I make pyenv rebuild the sqlite part of python without removing and reinstalling the entire version? Turns out I didn't have the sqlite headers installed when I installed 3.9.
03:11:36<@JAA>This was never intended to be long-lived. Remember that qwarc in its current form is basically the code I wrote for one specific project years ago, repackaged into something somewhat reusable.
03:11:56<@JAA>TheTechRobo: As far as I know, no.
03:12:40<@JAA>qwarc also used to use warcio. I ripped that out in record time when I discovered its intentional data mangling.
03:12:53<@JAA>So now it's bespoke custom WARC-writing code.
03:12:54<TheTechRobo>I have wondered, are WARCs made by warcio still in the WBM?
03:13:06<TheTechRobo>Not just for qwarc, but also for other things
03:13:34<@JAA>Replacing that is at the top of my qwarc todo list, hence pywarc.
03:13:54<@JAA>I'm sure there's warcio data in the WBM, yeah.
03:15:06<TheTechRobo>Are the old qwarc grabs still in the WBM?
03:15:16<@JAA>I believe so.
03:17:50<h2ibot>TheTechRobo edited Qwarc (+248, Add dependency information): https://wiki.archiveteam.org/?diff=54289&oldid=53904
03:18:36<nicolas17>optane10 is on fire
03:27:42<nicolas17>optane10 is consistently returning "max connections -1" on youtube, and "connection refused" on blogger
03:36:53<h2ibot>PaulWise edited ArchiveBot/Ignore (+30, better facebook/instagram ignore): https://wiki.archiveteam.org/?diff=54290&oldid=54271
03:37:12<@JAA>That's been mentioned in the project channels, yes.
03:39:54<h2ibot>PaulWise edited ArchiveBot/Ignore (+193, add wordpress junk): https://wiki.archiveteam.org/?diff=54291&oldid=54290
03:39:55<h2ibot>PaulWise edited ArchiveBot/Ignore (+2, ignore trailing / too): https://wiki.archiveteam.org/?diff=54292&oldid=54291
03:40:54<h2ibot>PaulWise edited ArchiveBot/Ignore (+6, pinterest ignore other language subdomains too): https://wiki.archiveteam.org/?diff=54293&oldid=54292
03:43:04<TheTechRobo>JAA: I assume in the generate(cls) function, whatever I queue has to be a string?
03:43:39<@JAA>TheTechRobo: Yes
03:43:56<@JAA>Also, ensure there are no dupes.
03:58:07graham9 quits [Client Quit]
03:58:40<TheTechRobo>Does qwarc write to stdout?
04:01:25pixel (pixel) joins
04:04:18pixel leaves
04:04:22pixel (pixel) joins
04:04:31<@JAA>TheTechRobo: Only if your spec file does.
04:05:03<@JAA>qwarc on its own, no.
04:08:42Wohlstand quits [Quit: Wohlstand]
04:10:06etnguyen03 quits [Remote host closed the connection]
04:11:12<@JAA>(I do sometimes output things on FD 3 or similar for scripting around qwarc.)
04:13:54ljcool2006_ quits [Quit: Leaving]
04:33:55<TheTechRobo>AttributeError: type object '_asyncio.Task' has no attribute 'current_task' on Python 3.9
04:34:16<@JAA>Welp
04:35:35<@JAA>Oh yeah, deprecated in 3.7, removed in 3.9.
04:35:52<@JAA>Again, not used in qwarc, so I bet it's aiohttp.
04:36:23<TheTechRobo>Yup
04:37:25<TheTechRobo>You said pywarc will be provide an API for HTTP requests/responses, right? I assume it'll also do weird things to aiohttp?
04:39:55<TheTechRobo>Er, this might be a stupid question, but is there a way to override qwarc's user agent? You can set one in `headers`, but then you'll have two.
04:43:07<h2ibot>TheTechRobo edited Qwarc (+52): https://wiki.archiveteam.org/?diff=54294&oldid=54289
04:45:03<@JAA>No, pywarc won't use aiohttp. It'll probably h11 with sync and async wrappers.
04:46:25<@JAA>Heh, there's been a todo comment in the code since 2019 about header overriding.
04:47:43<TheTechRobo>I'll take that as a no then. :-)
04:47:44<@JAA>The default headers are stored in the item's `headers` attribute. You can manipulate that from `__init__`, for example (*after* the `super().__init__` call).
04:47:56<TheTechRobo>spoke too soon
04:48:11<@JAA>E.g. `def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs); self.headers = []`
05:02:38beardicus quits [Ping timeout: 260 seconds]
05:08:52<TheTechRobo>I like how I said I didn't have time to write a spec file, then proceeded to spend two hours writing my first one.
05:08:56<TheTechRobo>Procrastination is fun. lol
05:11:28beardicus (beardicus) joins
05:14:16<TheTechRobo>If it's useful to anyone, this pretty much just takes a list of URL + body data + HTTP verb, and requests it all. No retries, but there's a JSON log with the status code to stdout. https://transfer.archivete.am/inline/103umH/yoink.py
05:21:41BlueMaxima quits [Read error: Connection reset by peer]
05:29:51<pokechu22>I don't have the ability to upload WARCs that end up in WBM (though I guess for a warc of POSTs that's not relevant, but the data in question is all of the URLs with # in them in https://transfer.archivete.am/inline/TnzuJ/sleepnomoreauction.com_urls_2.txt and the headers from line 50 of https://transfer.archivete.am/twnvK/auction.io_sleepnomoreauction.com_process_2.py
05:29:52<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/twnvK/auction.io_sleepnomoreauction.com_process_2.py
05:34:08<TheTechRobo>!remindme 8h do thing
05:34:09<eggdrop>[remind] ok, i'll remind you at 2025-01-30T13:34:08Z
05:38:17Webuser884331 joins
05:38:49<Webuser884331>question.... yahoo breif case
05:39:58beardicus quits [Ping timeout: 260 seconds]
05:40:13<@OrIdow6>Webuser884331: We didn't get it, sorry
05:40:33<@OrIdow6>"...but the warning time was roughly 60 days, which is long by Yahoo standards but hardly ideal for a service up for nearly a decade" per https://wiki.archiveteam.org/index.php/Yahoo!_Briefcase
05:43:15<Webuser884331>what about someones hotmail
05:43:36<Webuser884331>my mum died
05:43:54<Webuser884331>i want any photos she saved
05:46:17<@JAA>I think that was before AT even existed, although only barely.
06:20:18Webuser884331 quits [Client Quit]
06:23:20<@JAA>Actually, not quite. AT emerged in January 2009, domain registration on 2009-01-06. I thought it was a bit later that year.
06:54:17Dango360 quits [Read error: Connection reset by peer]
07:55:09pabs quits [Read error: Connection reset by peer]
07:55:37pabs (pabs) joins
08:30:25SootBector quits [Remote host closed the connection]
08:30:42SootBector (SootBector) joins
08:33:09`
08:40:08Emitewiki joins
08:40:24<Emitewiki>Anything we can do about this, or is it outside our purview/already done? https://bsky.app/profile/bobpony.com/post/3lgvxot2kos2j
08:41:04<pabs>"Microsoft will be removing the downloads for old Windows Themes in the future."
08:41:10<pabs>https://support.microsoft.com/windows/windows-themes-94880287-6046-1d35-6d2f-35dee759701e
08:42:17<pabs>looks like it will work in AB
08:42:51<Emitewiki>Sweet.
08:45:19<pabs>seems to be working, but some of the themes are already 404, including from a browser
08:45:46<Emitewiki>💀
08:46:19<pabs>it likely can't save the Windows Store stuff, which is behind a weird link ms-windows-store://collection/?collectionid=WindowsThemes
08:52:05<Emitewiki>Dang. Any way for us to manually do some shenanagins to save it?
08:54:19<pabs>reading the page again, that part isn't in danger, just the direct links, which are being saved
08:55:09<pabs>should be on archive.org in a few days
08:59:47<Emitewiki>Ah, you're right. Cool cool, thanks
08:59:48<Emitewiki>!
09:31:14Island quits [Read error: Connection reset by peer]
09:56:35scurvy_duck joins
10:00:32<Emitewiki>Anyone mind sending this through AB? The dev is starting to delist some of their games from stores, and this usually precludes a website shutdown, so I just want to be extra safe. https://www.catsoulstudios.com/
10:03:16<that_lurker>sure
10:03:47<Emitewiki>Thanks.
10:03:51<that_lurker>is there a news article about that somewhere?
10:04:25<Emitewiki>It's an announcement on their Steam game notices.
10:04:34<Emitewiki>So, like, within the Steam interface itself.
10:04:43<that_lurker>aa ok
10:06:46Emitewiki quits [Client Quit]
11:30:15Webuser220882 joins
11:30:26Webuser220882 quits [Client Quit]
11:34:13PotatoProton01 joins
11:51:00PotatoProton01 quits [Client Quit]
12:00:03Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat]
12:02:49Bleo18260072271962345 joins
12:13:29lennier2_ joins
12:16:03lennier2 quits [Ping timeout: 260 seconds]
12:16:46icedice (icedice) joins
12:24:59pie_ (pie_) joins
12:28:33pie_ quits [Client Quit]
12:30:14pie_ (pie_) joins
12:30:21pie_ quits [Client Quit]
12:30:33pie_ (pie_) joins
12:35:20SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962]
12:35:36etnguyen03 (etnguyen03) joins
12:35:52SkilledAlpaca418962 joins
13:07:02beardicus (beardicus) joins
13:17:35iram quits [Quit: ~]
13:18:05iram joins
13:26:24Naruyoko5 joins
13:27:36Naruyoko quits [Ping timeout: 250 seconds]
13:27:48scurvy_duck quits [Ping timeout: 260 seconds]
13:34:08<eggdrop>[remind] TheTechRobo: do thing
14:04:38Wohlstand (Wohlstand) joins
14:06:49<TheTechRobo>pokechu22: So an example request might be a POST to https://auctionsoftware.net/mobileapi/fetchLocation with the body {"countryCode": 62} ?
14:07:20<TheTechRobo>Is there any link extraction needed or do they just have to be grabbed?
14:10:42katocala joins
14:25:20hexa- quits [Quit: WeeChat 4.4.3]
14:26:27hexa- (hexa-) joins
14:36:50BornOn420 quits [Remote host closed the connection]
14:37:23BornOn420 (BornOn420) joins
14:55:21<anarcat>is this on someone's radar? https://www.reddit.com/r/DataHoarder/comments/1idm9ii/datagov_is_currently_being_scrubbed/
14:55:49<anarcat>i'm getting kind of exhausted at the "okay, this fascist government is in, and they're going to destroy the entire digital infrastructure of country foo, let's crawl"
14:59:23<kiska>Some 2.2k datasets have been removed, what they are, I don't know
15:00:06kansei quits [Quit: ZNC 1.9.1 - https://znc.in]
15:01:42kansei (kansei) joins
15:34:15holbrooke joins
15:35:46riteo (riteo) joins
15:46:34earl joins
16:17:04Wohlstand quits [Remote host closed the connection]
16:17:36Wohlstand (Wohlstand) joins
16:23:58katocala quits [Ping timeout: 260 seconds]
16:24:12katocala joins
16:25:30midou quits [Remote host closed the connection]
16:25:39midou joins
16:39:56loug8318142 joins
16:40:49Wohlstand quits [Client Quit]
16:53:53SootBector quits [Remote host closed the connection]
16:54:17SootBector (SootBector) joins
17:22:54katocala quits [Ping timeout: 250 seconds]
17:23:08katocala joins
17:26:30Wohlstand (Wohlstand) joins
17:33:18holbrooke quits [Client Quit]
17:49:07<pokechu22>TheTechRobo: I've already done all of the link extraction (that's what the other lines in the file are); they just need to be grabbed. (But also the additional headers are needed; you should get a JSON response, not an HTML response)
18:09:05scurvy_duck joins
18:13:59etnguyen03 quits [Quit: Konversation terminated!]
18:24:09<TheTechRobo>pokechu22: Any rate limiting?
18:24:27<pokechu22>I didn't run into any with my script
18:25:09<pokechu22>(but the script was at concurrency 1 effectively)
18:28:55etnguyen03 (etnguyen03) joins
18:32:10holbrooke joins
18:45:14holbrooke quits [Ping timeout: 250 seconds]
18:52:43icedice quits [Quit: Leaving]
18:56:52<@JAA>From #hackint: 14:02:38 < i> They say that https://data.gov/ is getting deleted as we speak, losing 1000 datasets a day.
19:07:46notarobot1 quits [Ping timeout: 250 seconds]
19:08:40<TheTechRobo>pokechu22: I think I got all of them downloaded to WARC. I don't have WBM permission either, but as you said, it's all POST, so kind of a moot point.
19:10:05<pokechu22>Thanks. It's probably worth doing it a second time near when the auctions finish (which I thought was tomorrow, but it looks like they extended it to Feb 2? or maybe I just confused myself)
19:10:22<pokechu22>!remindme 3d https://sleepnomoreauction.com/ auctions close shortly cc TheTechRobo
19:10:23<eggdrop>[remind] ok, i'll remind you at 2025-02-02T19:10:22Z
19:13:17<TheTechRobo>pokechu22: Will the URLs be the same the second time around?
19:13:35<TheTechRobo>+ POST data
19:13:56<pokechu22>They should. I'll re-run my script and make sure but unless they list new items (which seems unlikely given that they're closing) it shouldn't change
19:14:38<TheTechRobo>Ack
19:16:16<TheTechRobo>Waiting for book_op.php to decide I'm not a serial killer...
19:20:29beardicus quits [Remote host closed the connection]
19:20:30<TheTechRobo>Up at https://archive.org/details/warc-sleepnomoreauction-post-urls
19:26:22scurvy_duck quits [Client Quit]
19:26:22moth_ quits [Read error: Connection reset by peer]
19:26:49moth_ joins
21:03:16cascode joins
21:04:45earl quits []
21:09:12moth_ quits [Read error: Connection reset by peer]
21:13:24etnguyen03 quits [Client Quit]
22:03:31balrog_ is now known as balrog
22:07:30<h2ibot>TheTechRobo edited Qwarc (+1780, Add some basic documentation): https://wiki.archiveteam.org/?diff=54295&oldid=54294
22:07:31loug8318142 quits [Quit: The Lounge - https://thelounge.chat]
22:07:56balrog quits [Quit: Bye]
22:08:14balrog (balrog) joins
22:14:31<h2ibot>TheTechRobo edited Qwarc (-1): https://wiki.archiveteam.org/?diff=54296&oldid=54295
22:42:16graham9 joins
22:43:27graham9 quits [Client Quit]
22:45:00graham9 joins
22:46:31graham9 quits [Client Quit]
22:47:01graham9 joins
22:49:48graham9 quits [Client Quit]
23:00:43SootBector quits [Remote host closed the connection]
23:00:53SootBector (SootBector) joins
23:03:12utulien joins
23:09:07etnguyen03 (etnguyen03) joins
23:09:52<opl>made url lists for catalog.data.gov. apparently the fluctuating amount of available datasets is kinda normal, as it's been happening in older versions available through the wayback machine. doesn't matter though, as it still yielded a bunch of new urls
23:10:57<opl>here's a (mostly complete) list of urls linked to through the website's resources. as the catalog doesn't actually store the data by itself, all of this is hosted on other websites https://transfer.archivete.am/11tMVS/catalog.data.gov-found-urls.txt
23:10:59<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/11tMVS/catalog.data.gov-found-urls.txt
23:12:47<opl>and then there are the urls on catalog.data.gov, to allow browsing it: https://transfer.archivete.am/76FH3/catalog.data.gov-urls.txt (initial api search + html catalog pages sorted by title + datasets) and harvest metadata which allows finding out where the data got indexed from https://transfer.archivete.am/gQgQy/catalog.data.gov-harvest-urls.txt
23:12:48<eggdrop>inline (for browser viewing): https://transfer.archivete.am/inline/76FH3/catalog.data.gov-urls.txt https://transfer.archivete.am/inline/gQgQy/catalog.data.gov-harvest-urls.txt
23:31:22<pokechu22>opl: which do you think is most important to run first?
23:31:44<pokechu22>... and, is there any rate-limiting?
23:32:00<opl>i have no idea :)
23:32:09Wohlstand quits [Quit: Wohlstand]
23:32:10<opl>ok, more seriously
23:32:51<opl>the catalog seems to just be a search engine for externally hosted data. i have no idea what datasets get deleted, nor why
23:34:05<opl>it could be that the reason the datasets are disappearing is because they disappeared from the websites the dataset in the catalog links to, in which case urls from both lists might be disappearing at the same time
23:35:47<opl>that is, if i understand how this works correctly. i'm making some assumptions: if you go to a random dataset, scroll down to the "metadata source", click on the "harvested from" link, and then go to the "about" tab, you'll find that it links too a .json file on some external website
23:35:52<pokechu22>I think I might concatenate all 3 into one big list, shuffle it, and then run that
23:36:20<opl>and it seems those jsons linked to harvests are the source of truth
23:38:57<opl>yeah, shuffling it all seems sane enough. the api and frontend catalog urls are sorted by creation date and title respectively, so some stuff might be lost to pagination if the datasets happen to get updated while the urls are in the queue
23:39:55<opl>but there are links to the individual datasets in there too, so ultimately whatever datasets exist now will still get hit
23:41:31etnguyen03 quits [Client Quit]
23:46:34<pokechu22>I've started 4 archivebot jobs for it (each with different parts of the shuffled list). Hopefully there won't be any issues with that (if there are super large files then the high concurrency I'm using might cause problems but we can deal with that later)
23:47:53<opl>nice, thanks
23:51:57Island joins
23:54:06<pokechu22>looks like things aren't going to be perfect, e.g. https://data.usaid.gov/d/vaeq-cj7j redirects to https://data.usaid.gov/Basic-Education/Nepal-Early-Grade-Reading-Program-EGRP-/vaeq-cj7j/about_data and archivebot will reject that due to the no-parent rule (which is a thing for !ao < list jobs even though that doesn't really make sense), though it probably also won't work due
23:54:08<pokechu22>to javascript
23:54:27<pokechu22>but we'll still end up saving some stuff at least
23:56:11<opl>yeah, a lot of the datasets are just links to websites with more information about them, urls to map tile services, and other stuff like that
23:56:58<opl>information about most of the datasets' existence don't exist in the wayback machine at all though, so at least that'll get preserved
23:57:11<opl>s/don't/doesn't/
23:58:13ducky quits [Ping timeout: 260 seconds]