00:00:45 | | Matthww quits [Client Quit] |
00:03:21 | | Matthww joins |
00:03:56 | | Matthww quits [Remote host closed the connection] |
00:13:21 | | etnguyen03 (etnguyen03) joins |
00:18:03 | | loug8318142 quits [Quit: The Lounge - https://thelounge.chat] |
00:26:53 | | etnguyen03 quits [Client Quit] |
00:38:10 | | HP_Archivist quits [Quit: Leaving] |
00:43:50 | | beardicus (beardicus) joins |
01:01:33 | | etnguyen03 (etnguyen03) joins |
01:15:08 | | beardicus quits [Ping timeout: 260 seconds] |
01:23:19 | | beardicus (beardicus) joins |
01:33:19 | | hackbug quits [Remote host closed the connection] |
01:34:12 | | hackbug (hackbug) joins |
01:48:39 | | nine quits [Quit: See ya!] |
01:48:52 | | nine joins |
01:48:52 | | nine is now authenticated as nine |
01:48:52 | | nine quits [Changing host] |
01:48:52 | | nine (nine) joins |
02:07:03 | | beardicus quits [Ping timeout: 260 seconds] |
02:22:56 | | beardicus (beardicus) joins |
02:30:48 | | sec^nd quits [Remote host closed the connection] |
02:30:52 | | SootBector quits [Remote host closed the connection] |
02:30:53 | | HP_Archivist (HP_Archivist) joins |
02:31:08 | | sec^nd (second) joins |
02:31:11 | | SootBector (SootBector) joins |
02:31:59 | | StarletCharlotte quits [Read error: Connection reset by peer] |
02:33:01 | <pokechu22> | Do we have any good way of recording a bunch of POST requests with known data and URLs into a WARC? I generated a list of those for https://sleepnomoreauction.com/ yesterday, but I'm only able to save the images with archivebot since everything else is POST (note that they also require an origin header, and possibly some other headers) |
02:42:08 | | etnguyen03 quits [Client Quit] |
02:49:44 | <TheTechRobo> | Should be theoretically easy with qwarc from my previous research, but /me doesn't currently have time to write a spec file |
02:50:43 | | pabs quits [Read error: Connection reset by peer] |
02:52:05 | <@OrIdow6> | Could also be done with wget-at |
02:52:17 | | graham9 joins |
02:52:25 | | pabs (pabs) joins |
02:53:06 | | etnguyen03 (etnguyen03) joins |
02:58:10 | <@JAA> | Yeah, easy with qwarc. |
02:59:18 | <pabs> | -feed/#hackernews- Society for Technical Communication to permanently close its doors https://www.stc.org/ https://news.ycombinator.com/item?id=42867324 |
03:01:47 | <h2ibot> | OrIdow6 edited Niconico (+432, /* Nico Nico Seiga */ Nico Nico Shunga has done…): https://wiki.archiveteam.org/?diff=54287&oldid=54264 |
03:03:16 | <TheTechRobo> | AttributeError: module 'asyncio' has no attribute 'coroutine'. Did you mean: 'coroutines'? |
03:03:16 | <TheTechRobo> | What version of Python should I use for qwarc? |
03:03:43 | <TheTechRobo> | (I'm on the latest commit in the 0.2 branch) |
03:04:38 | <@JAA> | Hmm yeah, that was removed in 3.11. |
03:04:58 | <@JAA> | FWIW, that isn't used in qwarc's code, so it'll be from a dependency. |
03:05:06 | <@JAA> | Probably the ancient aiohttp. |
03:05:09 | <TheTechRobo> | Ah, yeah, aiohttp |
03:06:26 | <@JAA> | The aiohttp code is ugly because it doesn't expose the raw HTTP traffic, so it hard-depends on that ancient version. |
03:06:56 | <@JAA> | I usually run my things under 3.6, but I know 3.8 works fine. Not sure about newer ones. |
03:06:56 | <TheTechRobo> | On 3.9 I get |
03:06:56 | <TheTechRobo> | class CeilTimeout(Timeout): |
03:06:57 | <TheTechRobo> | TypeError: function() argument 'code' must be code, not str |
03:07:01 | <TheTechRobo> | Also in aiohttp. |
03:07:12 | <@JAA> | Not in async-timeout? |
03:07:25 | <TheTechRobo> | /home/thetechrobo/qwarc/venv/lib/python3.9/site-packages/aiohttp/helpers.py |
03:07:36 | <@JAA> | Hmm yeah, I guess that's where the error happens. |
03:07:40 | <@JAA> | You need async-timeout==3.0.1. |
03:07:49 | <h2ibot> | OrIdow6 edited Web Roasting (+283, Explain what it is a bit): https://wiki.archiveteam.org/?diff=54288&oldid=30443 |
03:08:09 | <@JAA> | https://github.com/aio-libs/aiohttp/issues/6320 |
03:08:11 | | pie_ quits [] |
03:09:43 | <nicolas17> | let's build our own library |
03:09:54 | <nicolas17> | with h11, blackjack and hookers |
03:10:02 | <@JAA> | That's the plan, yes. |
03:10:52 | <TheTechRobo> | Can I make pyenv rebuild the sqlite part of python without removing and reinstalling the entire version? Turns out I didn't have the sqlite headers installed when I installed 3.9. |
03:11:36 | <@JAA> | This was never intended to be long-lived. Remember that qwarc in its current form is basically the code I wrote for one specific project years ago, repackaged into something somewhat reusable. |
03:11:56 | <@JAA> | TheTechRobo: As far as I know, no. |
03:12:40 | <@JAA> | qwarc also used to use warcio. I ripped that out in record time when I discovered its intentional data mangling. |
03:12:53 | <@JAA> | So now it's bespoke custom WARC-writing code. |
03:12:54 | <TheTechRobo> | I have wondered, are WARCs made by warcio still in the WBM? |
03:13:06 | <TheTechRobo> | Not just for qwarc, but also for other things |
03:13:34 | <@JAA> | Replacing that is at the top of my qwarc todo list, hence pywarc. |
03:13:54 | <@JAA> | I'm sure there's warcio data in the WBM, yeah. |
03:15:06 | <TheTechRobo> | Are the old qwarc grabs still in the WBM? |
03:15:16 | <@JAA> | I believe so. |
03:17:50 | <h2ibot> | TheTechRobo edited Qwarc (+248, Add dependency information): https://wiki.archiveteam.org/?diff=54289&oldid=53904 |
03:18:36 | <nicolas17> | optane10 is on fire |
03:27:42 | <nicolas17> | optane10 is consistently returning "max connections -1" on youtube, and "connection refused" on blogger |
03:36:53 | <h2ibot> | PaulWise edited ArchiveBot/Ignore (+30, better facebook/instagram ignore): https://wiki.archiveteam.org/?diff=54290&oldid=54271 |
03:37:12 | <@JAA> | That's been mentioned in the project channels, yes. |
03:39:54 | <h2ibot> | PaulWise edited ArchiveBot/Ignore (+193, add wordpress junk): https://wiki.archiveteam.org/?diff=54291&oldid=54290 |
03:39:55 | <h2ibot> | PaulWise edited ArchiveBot/Ignore (+2, ignore trailing / too): https://wiki.archiveteam.org/?diff=54292&oldid=54291 |
03:40:54 | <h2ibot> | PaulWise edited ArchiveBot/Ignore (+6, pinterest ignore other language subdomains too): https://wiki.archiveteam.org/?diff=54293&oldid=54292 |
03:43:04 | <TheTechRobo> | JAA: I assume in the generate(cls) function, whatever I queue has to be a string? |
03:43:39 | <@JAA> | TheTechRobo: Yes |
03:43:56 | <@JAA> | Also, ensure there are no dupes. |
03:58:07 | | graham9 quits [Client Quit] |
03:58:40 | <TheTechRobo> | Does qwarc write to stdout? |
04:01:25 | | pixel (pixel) joins |
04:04:18 | | pixel leaves |
04:04:22 | | pixel (pixel) joins |
04:04:31 | <@JAA> | TheTechRobo: Only if your spec file does. |
04:05:03 | <@JAA> | qwarc on its own, no. |
04:08:42 | | Wohlstand quits [Quit: Wohlstand] |
04:10:06 | | etnguyen03 quits [Remote host closed the connection] |
04:11:12 | <@JAA> | (I do sometimes output things on FD 3 or similar for scripting around qwarc.) |
04:13:54 | | ljcool2006_ quits [Quit: Leaving] |
04:33:55 | <TheTechRobo> | AttributeError: type object '_asyncio.Task' has no attribute 'current_task' on Python 3.9 |
04:34:16 | <@JAA> | Welp |
04:35:35 | <@JAA> | Oh yeah, deprecated in 3.7, removed in 3.9. |
04:35:52 | <@JAA> | Again, not used in qwarc, so I bet it's aiohttp. |
04:36:23 | <TheTechRobo> | Yup |
04:37:25 | <TheTechRobo> | You said pywarc will be provide an API for HTTP requests/responses, right? I assume it'll also do weird things to aiohttp? |
04:39:55 | <TheTechRobo> | Er, this might be a stupid question, but is there a way to override qwarc's user agent? You can set one in `headers`, but then you'll have two. |
04:43:07 | <h2ibot> | TheTechRobo edited Qwarc (+52): https://wiki.archiveteam.org/?diff=54294&oldid=54289 |
04:45:03 | <@JAA> | No, pywarc won't use aiohttp. It'll probably h11 with sync and async wrappers. |
04:46:25 | <@JAA> | Heh, there's been a todo comment in the code since 2019 about header overriding. |
04:47:43 | <TheTechRobo> | I'll take that as a no then. :-) |
04:47:44 | <@JAA> | The default headers are stored in the item's `headers` attribute. You can manipulate that from `__init__`, for example (*after* the `super().__init__` call). |
04:47:56 | <TheTechRobo> | spoke too soon |
04:48:11 | <@JAA> | E.g. `def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs); self.headers = []` |
05:02:38 | | beardicus quits [Ping timeout: 260 seconds] |
05:08:52 | <TheTechRobo> | I like how I said I didn't have time to write a spec file, then proceeded to spend two hours writing my first one. |
05:08:56 | <TheTechRobo> | Procrastination is fun. lol |
05:11:28 | | beardicus (beardicus) joins |
05:14:16 | <TheTechRobo> | If it's useful to anyone, this pretty much just takes a list of URL + body data + HTTP verb, and requests it all. No retries, but there's a JSON log with the status code to stdout. https://transfer.archivete.am/inline/103umH/yoink.py |
05:21:41 | | BlueMaxima quits [Read error: Connection reset by peer] |
05:29:51 | <pokechu22> | I don't have the ability to upload WARCs that end up in WBM (though I guess for a warc of POSTs that's not relevant, but the data in question is all of the URLs with # in them in https://transfer.archivete.am/inline/TnzuJ/sleepnomoreauction.com_urls_2.txt and the headers from line 50 of https://transfer.archivete.am/twnvK/auction.io_sleepnomoreauction.com_process_2.py |
05:29:52 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/twnvK/auction.io_sleepnomoreauction.com_process_2.py |
05:34:08 | <TheTechRobo> | !remindme 8h do thing |
05:34:09 | <eggdrop> | [remind] ok, i'll remind you at 2025-01-30T13:34:08Z |
05:38:17 | | Webuser884331 joins |
05:38:49 | <Webuser884331> | question.... yahoo breif case |
05:39:58 | | beardicus quits [Ping timeout: 260 seconds] |
05:40:13 | <@OrIdow6> | Webuser884331: We didn't get it, sorry |
05:40:33 | <@OrIdow6> | "...but the warning time was roughly 60 days, which is long by Yahoo standards but hardly ideal for a service up for nearly a decade" per https://wiki.archiveteam.org/index.php/Yahoo!_Briefcase |
05:43:15 | <Webuser884331> | what about someones hotmail |
05:43:36 | <Webuser884331> | my mum died |
05:43:54 | <Webuser884331> | i want any photos she saved |
05:46:17 | <@JAA> | I think that was before AT even existed, although only barely. |
06:20:18 | | Webuser884331 quits [Client Quit] |
06:23:20 | <@JAA> | Actually, not quite. AT emerged in January 2009, domain registration on 2009-01-06. I thought it was a bit later that year. |
06:54:17 | | Dango360 quits [Read error: Connection reset by peer] |
07:55:09 | | pabs quits [Read error: Connection reset by peer] |
07:55:37 | | pabs (pabs) joins |
08:30:25 | | SootBector quits [Remote host closed the connection] |
08:30:42 | | SootBector (SootBector) joins |
08:33:09 | | ` |
08:40:08 | | Emitewiki joins |
08:40:24 | <Emitewiki> | Anything we can do about this, or is it outside our purview/already done? https://bsky.app/profile/bobpony.com/post/3lgvxot2kos2j |
08:41:04 | <pabs> | "Microsoft will be removing the downloads for old Windows Themes in the future." |
08:41:10 | <pabs> | https://support.microsoft.com/windows/windows-themes-94880287-6046-1d35-6d2f-35dee759701e |
08:42:17 | <pabs> | looks like it will work in AB |
08:42:51 | <Emitewiki> | Sweet. |
08:45:19 | <pabs> | seems to be working, but some of the themes are already 404, including from a browser |
08:45:46 | <Emitewiki> | 💀 |
08:46:19 | <pabs> | it likely can't save the Windows Store stuff, which is behind a weird link ms-windows-store://collection/?collectionid=WindowsThemes |
08:52:05 | <Emitewiki> | Dang. Any way for us to manually do some shenanagins to save it? |
08:54:19 | <pabs> | reading the page again, that part isn't in danger, just the direct links, which are being saved |
08:55:09 | <pabs> | should be on archive.org in a few days |
08:59:47 | <Emitewiki> | Ah, you're right. Cool cool, thanks |
08:59:48 | <Emitewiki> | ! |
09:31:14 | | Island quits [Read error: Connection reset by peer] |
09:56:35 | | scurvy_duck joins |
10:00:32 | <Emitewiki> | Anyone mind sending this through AB? The dev is starting to delist some of their games from stores, and this usually precludes a website shutdown, so I just want to be extra safe. https://www.catsoulstudios.com/ |
10:03:16 | <that_lurker> | sure |
10:03:47 | <Emitewiki> | Thanks. |
10:03:51 | <that_lurker> | is there a news article about that somewhere? |
10:04:25 | <Emitewiki> | It's an announcement on their Steam game notices. |
10:04:34 | <Emitewiki> | So, like, within the Steam interface itself. |
10:04:43 | <that_lurker> | aa ok |
10:06:46 | | Emitewiki quits [Client Quit] |
11:30:15 | | Webuser220882 joins |
11:30:26 | | Webuser220882 quits [Client Quit] |
11:34:13 | | PotatoProton01 joins |
11:51:00 | | PotatoProton01 quits [Client Quit] |
12:00:03 | | Bleo18260072271962345 quits [Quit: The Lounge - https://thelounge.chat] |
12:02:49 | | Bleo18260072271962345 joins |
12:13:29 | | lennier2_ joins |
12:16:03 | | lennier2 quits [Ping timeout: 260 seconds] |
12:16:46 | | icedice (icedice) joins |
12:24:59 | | pie_ (pie_) joins |
12:28:33 | | pie_ quits [Client Quit] |
12:30:14 | | pie_ (pie_) joins |
12:30:21 | | pie_ quits [Client Quit] |
12:30:33 | | pie_ (pie_) joins |
12:35:20 | | SkilledAlpaca418962 quits [Quit: SkilledAlpaca418962] |
12:35:36 | | etnguyen03 (etnguyen03) joins |
12:35:52 | | SkilledAlpaca418962 joins |
13:07:02 | | beardicus (beardicus) joins |
13:17:35 | | iram quits [Quit: ~] |
13:18:05 | | iram joins |
13:26:24 | | Naruyoko5 joins |
13:27:36 | | Naruyoko quits [Ping timeout: 250 seconds] |
13:27:48 | | scurvy_duck quits [Ping timeout: 260 seconds] |
13:34:08 | <eggdrop> | [remind] TheTechRobo: do thing |
14:04:38 | | Wohlstand (Wohlstand) joins |
14:06:49 | <TheTechRobo> | pokechu22: So an example request might be a POST to https://auctionsoftware.net/mobileapi/fetchLocation with the body {"countryCode": 62} ? |
14:07:20 | <TheTechRobo> | Is there any link extraction needed or do they just have to be grabbed? |
14:10:42 | | katocala joins |
14:10:59 | | katocala is now authenticated as katocala |
14:25:20 | | hexa- quits [Quit: WeeChat 4.4.3] |
14:26:27 | | hexa- (hexa-) joins |
14:36:50 | | BornOn420 quits [Remote host closed the connection] |
14:37:23 | | BornOn420 (BornOn420) joins |
14:55:21 | <anarcat> | is this on someone's radar? https://www.reddit.com/r/DataHoarder/comments/1idm9ii/datagov_is_currently_being_scrubbed/ |
14:55:49 | <anarcat> | i'm getting kind of exhausted at the "okay, this fascist government is in, and they're going to destroy the entire digital infrastructure of country foo, let's crawl" |
14:59:23 | <kiska> | Some 2.2k datasets have been removed, what they are, I don't know |
15:00:06 | | kansei quits [Quit: ZNC 1.9.1 - https://znc.in] |
15:01:42 | | kansei (kansei) joins |
15:34:15 | | holbrooke joins |
15:35:46 | | riteo (riteo) joins |
15:46:34 | | earl joins |
16:17:04 | | Wohlstand quits [Remote host closed the connection] |
16:17:36 | | Wohlstand (Wohlstand) joins |
16:23:58 | | katocala quits [Ping timeout: 260 seconds] |
16:24:12 | | katocala joins |
16:25:30 | | midou quits [Remote host closed the connection] |
16:25:39 | | midou joins |
16:39:56 | | loug8318142 joins |
16:40:49 | | Wohlstand quits [Client Quit] |
16:53:53 | | SootBector quits [Remote host closed the connection] |
16:54:17 | | SootBector (SootBector) joins |
17:22:54 | | katocala quits [Ping timeout: 250 seconds] |
17:23:08 | | katocala joins |
17:26:30 | | Wohlstand (Wohlstand) joins |
17:33:18 | | holbrooke quits [Client Quit] |
17:49:07 | <pokechu22> | TheTechRobo: I've already done all of the link extraction (that's what the other lines in the file are); they just need to be grabbed. (But also the additional headers are needed; you should get a JSON response, not an HTML response) |
18:09:05 | | scurvy_duck joins |
18:13:59 | | etnguyen03 quits [Quit: Konversation terminated!] |
18:24:09 | <TheTechRobo> | pokechu22: Any rate limiting? |
18:24:27 | <pokechu22> | I didn't run into any with my script |
18:25:09 | <pokechu22> | (but the script was at concurrency 1 effectively) |
18:28:55 | | etnguyen03 (etnguyen03) joins |
18:32:10 | | holbrooke joins |
18:45:14 | | holbrooke quits [Ping timeout: 250 seconds] |
18:52:43 | | icedice quits [Quit: Leaving] |
18:56:52 | <@JAA> | From #hackint: 14:02:38 < i> They say that https://data.gov/ is getting deleted as we speak, losing 1000 datasets a day. |
19:07:46 | | notarobot1 quits [Ping timeout: 250 seconds] |
19:08:40 | <TheTechRobo> | pokechu22: I think I got all of them downloaded to WARC. I don't have WBM permission either, but as you said, it's all POST, so kind of a moot point. |
19:10:05 | <pokechu22> | Thanks. It's probably worth doing it a second time near when the auctions finish (which I thought was tomorrow, but it looks like they extended it to Feb 2? or maybe I just confused myself) |
19:10:22 | <pokechu22> | !remindme 3d https://sleepnomoreauction.com/ auctions close shortly cc TheTechRobo |
19:10:23 | <eggdrop> | [remind] ok, i'll remind you at 2025-02-02T19:10:22Z |
19:13:17 | <TheTechRobo> | pokechu22: Will the URLs be the same the second time around? |
19:13:35 | <TheTechRobo> | + POST data |
19:13:56 | <pokechu22> | They should. I'll re-run my script and make sure but unless they list new items (which seems unlikely given that they're closing) it shouldn't change |
19:14:38 | <TheTechRobo> | Ack |
19:16:16 | <TheTechRobo> | Waiting for book_op.php to decide I'm not a serial killer... |
19:20:29 | | beardicus quits [Remote host closed the connection] |
19:20:30 | <TheTechRobo> | Up at https://archive.org/details/warc-sleepnomoreauction-post-urls |
19:26:22 | | scurvy_duck quits [Client Quit] |
19:26:22 | | moth_ quits [Read error: Connection reset by peer] |
19:26:49 | | moth_ joins |
21:03:16 | | cascode joins |
21:04:45 | | earl quits [] |
21:09:12 | | moth_ quits [Read error: Connection reset by peer] |
21:13:24 | | etnguyen03 quits [Client Quit] |
22:03:31 | | balrog_ is now known as balrog |
22:07:30 | <h2ibot> | TheTechRobo edited Qwarc (+1780, Add some basic documentation): https://wiki.archiveteam.org/?diff=54295&oldid=54294 |
22:07:31 | | loug8318142 quits [Quit: The Lounge - https://thelounge.chat] |
22:07:56 | | balrog quits [Quit: Bye] |
22:08:14 | | balrog (balrog) joins |
22:14:31 | <h2ibot> | TheTechRobo edited Qwarc (-1): https://wiki.archiveteam.org/?diff=54296&oldid=54295 |
22:42:16 | | graham9 joins |
22:43:27 | | graham9 quits [Client Quit] |
22:45:00 | | graham9 joins |
22:46:31 | | graham9 quits [Client Quit] |
22:47:01 | | graham9 joins |
22:49:48 | | graham9 quits [Client Quit] |
23:00:43 | | SootBector quits [Remote host closed the connection] |
23:00:53 | | SootBector (SootBector) joins |
23:03:12 | | utulien joins |
23:09:07 | | etnguyen03 (etnguyen03) joins |
23:09:52 | <opl> | made url lists for catalog.data.gov. apparently the fluctuating amount of available datasets is kinda normal, as it's been happening in older versions available through the wayback machine. doesn't matter though, as it still yielded a bunch of new urls |
23:10:57 | <opl> | here's a (mostly complete) list of urls linked to through the website's resources. as the catalog doesn't actually store the data by itself, all of this is hosted on other websites https://transfer.archivete.am/11tMVS/catalog.data.gov-found-urls.txt |
23:10:59 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/11tMVS/catalog.data.gov-found-urls.txt |
23:12:47 | <opl> | and then there are the urls on catalog.data.gov, to allow browsing it: https://transfer.archivete.am/76FH3/catalog.data.gov-urls.txt (initial api search + html catalog pages sorted by title + datasets) and harvest metadata which allows finding out where the data got indexed from https://transfer.archivete.am/gQgQy/catalog.data.gov-harvest-urls.txt |
23:12:48 | <eggdrop> | inline (for browser viewing): https://transfer.archivete.am/inline/76FH3/catalog.data.gov-urls.txt https://transfer.archivete.am/inline/gQgQy/catalog.data.gov-harvest-urls.txt |
23:31:22 | <pokechu22> | opl: which do you think is most important to run first? |
23:31:44 | <pokechu22> | ... and, is there any rate-limiting? |
23:32:00 | <opl> | i have no idea :) |
23:32:09 | | Wohlstand quits [Quit: Wohlstand] |
23:32:10 | <opl> | ok, more seriously |
23:32:51 | <opl> | the catalog seems to just be a search engine for externally hosted data. i have no idea what datasets get deleted, nor why |
23:34:05 | <opl> | it could be that the reason the datasets are disappearing is because they disappeared from the websites the dataset in the catalog links to, in which case urls from both lists might be disappearing at the same time |
23:35:47 | <opl> | that is, if i understand how this works correctly. i'm making some assumptions: if you go to a random dataset, scroll down to the "metadata source", click on the "harvested from" link, and then go to the "about" tab, you'll find that it links too a .json file on some external website |
23:35:52 | <pokechu22> | I think I might concatenate all 3 into one big list, shuffle it, and then run that |
23:36:20 | <opl> | and it seems those jsons linked to harvests are the source of truth |
23:38:57 | <opl> | yeah, shuffling it all seems sane enough. the api and frontend catalog urls are sorted by creation date and title respectively, so some stuff might be lost to pagination if the datasets happen to get updated while the urls are in the queue |
23:39:55 | <opl> | but there are links to the individual datasets in there too, so ultimately whatever datasets exist now will still get hit |
23:41:31 | | etnguyen03 quits [Client Quit] |
23:46:34 | <pokechu22> | I've started 4 archivebot jobs for it (each with different parts of the shuffled list). Hopefully there won't be any issues with that (if there are super large files then the high concurrency I'm using might cause problems but we can deal with that later) |
23:47:53 | <opl> | nice, thanks |
23:51:57 | | Island joins |
23:54:06 | <pokechu22> | looks like things aren't going to be perfect, e.g. https://data.usaid.gov/d/vaeq-cj7j redirects to https://data.usaid.gov/Basic-Education/Nepal-Early-Grade-Reading-Program-EGRP-/vaeq-cj7j/about_data and archivebot will reject that due to the no-parent rule (which is a thing for !ao < list jobs even though that doesn't really make sense), though it probably also won't work due |
23:54:08 | <pokechu22> | to javascript |
23:54:27 | <pokechu22> | but we'll still end up saving some stuff at least |
23:56:11 | <opl> | yeah, a lot of the datasets are just links to websites with more information about them, urls to map tile services, and other stuff like that |
23:56:58 | <opl> | information about most of the datasets' existence don't exist in the wayback machine at all though, so at least that'll get preserved |
23:57:11 | <opl> | s/don't/doesn't/ |
23:58:13 | | ducky quits [Ping timeout: 260 seconds] |