00:01:10etnguyen03 quits [Client Quit]
00:01:12Carnildo quits [Ping timeout: 265 seconds]
00:01:52etnguyen03 (etnguyen03) joins
00:07:32nertzy_ joins
00:09:49Carnildo joins
00:11:39etnguyen03 quits [Client Quit]
00:12:40<nulldata>!tell BenFranske Looks like there's at least some on IA but none recent. https://archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
00:12:40<eggdrop>[tell] ok, I'll tell BenFranske when they join next
00:13:17nertzy_ quits [Ping timeout: 265 seconds]
00:16:11Carnildo quits [Ping timeout: 265 seconds]
00:17:32Carnildo joins
00:27:56SootBector quits [Remote host closed the connection]
00:28:21SootBector (SootBector) joins
00:32:48nertzy_ joins
00:53:23DogsRNice joins
01:08:31<fireonlive>at least nulldata is useful
01:09:25<nulldata>Huh?
01:40:44michaelblob_ (michaelblob) joins
01:44:38michaelblob quits [Ping timeout: 265 seconds]
01:45:03michaelblob (michaelblob) joins
01:47:19michaelblob_ quits [Ping timeout: 255 seconds]
01:53:52tapos joins
01:58:18DogsRNice_ joins
02:01:43DogsRNice quits [Ping timeout: 255 seconds]
02:07:16Unholy23619246 (Unholy2361) joins
02:10:44Unholy2361924 quits [Ping timeout: 265 seconds]
02:10:44Unholy23619246 is now known as Unholy2361924
02:10:53BenFranske joins
02:10:54<eggdrop>[tell] BenFranske: [2024-05-02T00:12:40Z] <nulldata> Looks like there's at least some on IA but none recent. https://archive.org/details/twit-podcasts Was there an item of yours taken down? Probably should discuss in #internetarchive
02:13:59<BenFranske>nulldata Yes, I was working on uploading everything from there and had about 7500 episodes uploaded of the 24500 episodes that they have published. Most of them have been pulled and the rest are probably going to get pulled soon I think. My account also got locked at IA just a bit ago. See my currently still available set at
02:13:59<BenFranske>https://archive.org/details/@benfranske (currently 870 items rather than the 7500+ that were there earlier today)
02:15:25ell (ell) joins
02:24:29Carnildo_again joins
02:24:34Carnildo quits [Read error: Connection reset by peer]
02:26:49grid joins
02:37:31gaz joins
02:44:09DogsRNice_ quits [Read error: Connection reset by peer]
02:51:20Carnildo_again quits [Read error: Connection reset by peer]
02:52:26Carnildo joins
02:57:58Carnildo quits [Read error: Connection reset by peer]
02:58:08Carnildo joins
03:05:52etnguyen03 (etnguyen03) joins
03:11:29Carnildo quits [Read error: Connection reset by peer]
03:13:22Carnildo joins
03:13:22Carnildo quits [Read error: Connection reset by peer]
03:36:48BenFranske quits [Client Quit]
03:55:49Carnildo joins
04:01:21<@hook54321>does anyone know if https://dumps.wikimedia.org/other/shorturls/ is dumped anywhere on a regular basis? it looks like they used to be dumped to IA but haven't been for a few years https://archive.org/details/shorturls-20200907
04:02:27Carnildo quits [Remote host closed the connection]
04:02:39Carnildo joins
04:20:29Carnildo quits [Read error: Connection reset by peer]
04:20:44Carnildo joins
04:23:57nertzy_ quits [Client Quit]
04:24:38Carnildo quits [Read error: Connection reset by peer]
04:24:48Carnildo joins
04:29:12lennier2 joins
04:32:01lennier2_ quits [Ping timeout: 255 seconds]
04:36:43grid quits [Client Quit]
04:38:09Carnildo quits [Read error: Connection reset by peer]
04:38:33Carnildo joins
04:40:04shgaqnyrjp_ (shgaqnyrjp) joins
04:42:10shgaqnyrjp quits [Remote host closed the connection]
04:49:08Carnildo quits [Read error: Connection reset by peer]
04:49:19Carnildo joins
04:53:33shgaqnyrjp_ is now known as shgaqnyrjp
04:57:47Carnildo quits [Read error: Connection reset by peer]
04:58:11Carnildo joins
04:59:43kiryu__ joins
05:00:07Church quits [Quit: WeeChat info:version]
05:02:37kiryu_ quits [Ping timeout: 255 seconds]
05:05:03kiryu_ joins
05:07:09qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
05:08:01kiryu__ quits [Ping timeout: 255 seconds]
05:11:59qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
05:12:27benjinsmi joins
05:12:54etnguyen03 quits [Client Quit]
05:13:35benjins2_ joins
05:14:01Carnildo quits [Remote host closed the connection]
05:14:04Carnildo_again joins
05:15:51benjinsm quits [Ping timeout: 265 seconds]
05:15:51benjins2 quits [Ping timeout: 265 seconds]
05:16:58etnguyen03 (etnguyen03) joins
05:17:24Church (Church) joins
05:31:30Carnildo_again quits [Read error: Connection reset by peer]
05:31:34Carnildo joins
05:34:13qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
05:34:53etnguyen03 quits [Remote host closed the connection]
05:40:10kiryu_ quits [Remote host closed the connection]
05:41:31qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
05:41:40kiryu joins
05:41:40kiryu quits [Changing host]
05:41:40kiryu (kiryu) joins
05:45:17shgaqnyrjp quits [Remote host closed the connection]
05:45:33Carnildo quits [Read error: Connection reset by peer]
05:45:35Carnildo joins
05:45:54shgaqnyrjp (shgaqnyrjp) joins
05:53:01Carnildo quits [Read error: Connection reset by peer]
05:53:18Carnildo joins
06:01:29shgaqnyrjp quits [Remote host closed the connection]
06:02:05shgaqnyrjp (shgaqnyrjp) joins
06:02:44qwertyasdfuiopghjkl quits [Ping timeout: 265 seconds]
06:05:12qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
06:05:43no-n0rth joins
06:07:06<no-n0rth>Hey folks! I was looking at the Blingee archive, I'm looking for a file that has some of the stamp swf - would anyone here be familiar with the project?
06:14:43<pokechu22>Hmm, I don't know too much about blingee, but based on the information on https://wiki.archiveteam.org/index.php/Blingee someone here would probably be able to find it. If you have a URL then it'd be on web.archive.org; if you have something else I think that page has enough information on how to figure out the URL?
06:15:39BlueMaxima quits [Read error: Connection reset by peer]
06:17:22<no-n0rth>Thanks for the link! I followed that to the internet archive backups, but the files are huge lol and so far it seems most of them just have comments and gifs. I suspect the AES key was rotated, but I might try running the scraper tomorrow if I don't find a cdx that has swf files
06:22:19Carnildo quits [Read error: Connection reset by peer]
06:22:27Carnildo joins
06:59:47grid joins
07:06:09Unholy23619246 (Unholy2361) joins
07:07:21Doomaholic quits [Ping timeout: 272 seconds]
07:07:34Doomaholic (Doomaholic) joins
07:09:55Unholy2361924 quits [Ping timeout: 265 seconds]
07:14:20Arcorann_ joins
07:19:26Carnildo quits [Read error: Connection reset by peer]
07:19:31Carnildo joins
07:23:33Carnildo quits [Read error: Connection reset by peer]
07:23:42Carnildo joins
07:31:32Carnildo quits [Read error: Connection reset by peer]
07:31:43Carnildo joins
07:36:15Carnildo_again joins
07:36:17Carnildo quits [Remote host closed the connection]
07:55:50Carnildo_again quits [Read error: Connection reset by peer]
07:56:01Carnildo joins
07:58:59qwertyasdfuiopghjkl quits [Client Quit]
08:05:32shgaqnyrjp_ (shgaqnyrjp) joins
08:07:51shgaqnyrjp quits [Ping timeout: 250 seconds]
08:08:17SootBector quits [Ping timeout: 250 seconds]
08:11:45SootBector (SootBector) joins
08:15:19superkuh quits [Remote host closed the connection]
08:20:57superkuh joins
08:28:28Carnildo quits [Read error: Connection reset by peer]
08:28:33Carnildo joins
08:39:26Carnildo quits [Read error: Connection reset by peer]
08:39:37Carnildo joins
08:39:50<lea>say I want to archive a site that's behind a login wall. I could probably write a scraper for it. can I somehow upload the results to the web archive?
08:40:44<lea>site in question: https://usdb.animux.de/ (hosts synced song texts for karaoke apps)
08:41:52Carnildo quits [Read error: Connection reset by peer]
08:41:55Carnildo joins
08:54:40Carnildo quits [Read error: Connection reset by peer]
08:54:44Carnildo joins
09:00:05Bleo18260072 quits [Client Quit]
09:00:44Carnildo quits [Read error: Connection reset by peer]
09:00:58Carnildo joins
09:01:23Bleo18260072 joins
09:05:47<katia>lea, i think stuff that is behind login never goes to wayback machine
09:05:55<katia>but you/anyone can upload it to IA
09:06:14<katia>https://archive.org/developers/internetarchive/cli.html
09:06:56<lea>katia: is there a documentation on the preferred format for uploads?
09:07:30<lea>or should I just dump a zip file with all the current data of the site? what about new content? the site is still alive
09:08:06<katia>you probably want to design your scraper to be incremental then
09:09:40grid quits [Client Quit]
09:11:42<lea>yes
09:12:03<lea>since these are individual files, I guess I could just upload tens of thousands of individual files to the archive?
09:13:24<katia>maybe better for #internetarchive
09:14:27<katia>IA unpacks some .tar and maybe others, packing it/compressing it might make more sense than single files
09:18:13Doran quits [Ping timeout: 255 seconds]
09:21:15Carnildo quits [Read error: Connection reset by peer]
09:21:18Carnildo joins
09:22:45Doran (Doranwen) joins
09:26:45Doran quits [Remote host closed the connection]
09:36:00<thuban>lea: the best format for archival purposes is warc (you can upload warcs to the internet archive like any other item even though they don't go into the wayback machine).
09:36:03<thuban>i suggest using https://github.com/ArchiveTeam/grab-site/, which outputs warc and which you can configure to use your login cookies
09:38:22<lea>the page needs a JS-initiated HTTP POST to give out the data. I can also initiate it without JS. does the tool support a use case like that?
09:41:39<lea>thanks for the pointer btw
09:43:47Ruthalas59 quits [Ping timeout: 272 seconds]
09:44:06pabs quits [Ping timeout: 265 seconds]
09:45:03<thuban>lea: yes, you can use --wpull-args with wpull's --post options (see https://wpull.readthedocs.io/en/master/options.html) to send POST requests. that said, depending on the details this may become very inconvenient
09:47:03pabs (pabs) joins
09:47:06Ruthalas59 (Ruthalas) joins
09:57:57Doran (Doranwen) joins
10:00:05f_ (funderscore) joins
10:00:56<thuban>(since wpull uses the same post data for _all_ requests, worst-case, you may need to scrape the site once, process the output to determine what urls and post data you need for the txt downloads, and invoke grab-site on each individually in a loop. you can combine the results with eg warcat: https://github.com/chfoo/warcat)
10:01:11<thuban>(might still be quicker than writing your own scraper)
10:14:55Doran quits [Ping timeout: 255 seconds]
10:50:11<Miori>Did you guys see subscene closing down in 24 hours? https://forum.subscene.com/topic/subscene-is-closing-so-sorry
10:56:16<joepie91|m>well shit
10:56:37<katia>buttflare :|
10:57:00Doran (Doranwen) joins
10:57:43<katia>well not on subscene.com, just on forum?
10:58:11<katia>started an archivebot job for subscene.com
11:07:43f_ quits [Client Quit]
11:07:49SootBector quits [Remote host closed the connection]
11:08:18SootBector (SootBector) joins
11:17:32<Miori>https://www.reddit.com/r/DataHoarder/comments/1b5rxc2/subscenecom_full_dump/ and apparently https://subdl.com/ is mirroring data from subscene every hour
11:24:37<katia>nice
11:44:44thalia quits [Quit: Connection closed for inactivity]
11:51:02knecht4 quits [Client Quit]
11:52:03knecht4 joins
12:03:57nertzy_ joins
12:09:27Carnildo quits [Read error: Connection reset by peer]
12:09:35Carnildo joins
12:23:24Carnildo quits [Read error: Connection reset by peer]
12:23:47Carnildo joins
12:42:49jaxon joins
12:44:15jaxon quits [Client Quit]
13:00:02etnguyen03 (etnguyen03) joins
13:14:55Carnildo quits [Read error: Connection reset by peer]
13:15:03Carnildo joins
13:20:59Carnildo quits [Read error: Connection reset by peer]
13:21:41Carnildo joins
13:27:04Arcorann_ quits [Ping timeout: 255 seconds]
13:28:00nertzy_ quits [Client Quit]
13:40:34sonick quits [Client Quit]
13:42:39Carnildo quits [Read error: Connection reset by peer]
13:42:43Carnildo joins
13:52:09Carnildo quits [Read error: Connection reset by peer]
13:52:15Carnildo joins
13:53:17tapos quits [Client Quit]
13:55:03Wohlstand (Wohlstand) joins
14:03:56Carnildo quits [Read error: Connection reset by peer]
14:04:02Carnildo joins
14:09:54Carnildo quits [Read error: Connection reset by peer]
14:10:02Carnildo joins
14:17:31s-crypt quits [Quit: Ping timeout (120 seconds)]
14:17:43s-crypt (s-crypt) joins
14:18:13Carnildo quits [Read error: Connection reset by peer]
14:18:28Carnildo joins
14:23:13Carnildo_again joins
14:23:17Carnildo quits [Remote host closed the connection]
14:26:44Mateon1 quits [Quit: Mateon1]
14:27:25Mateon1 joins
14:28:53Carnildo_again quits [Read error: Connection reset by peer]
14:29:07Carnildo joins
14:31:36knecht4 quits [Client Quit]
14:33:35knecht4 joins
15:16:28f_ (funderscore) joins
15:26:54RealPerson joins
15:28:15qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
15:37:28etnguyen03 quits [Client Quit]
15:38:09etnguyen03 (etnguyen03) joins
15:40:25Carnildo quits [Read error: Connection reset by peer]
15:40:29Carnildo joins
15:47:56etnguyen03 quits [Client Quit]
15:48:37etnguyen03 (etnguyen03) joins
15:53:04Carnildo quits [Read error: Connection reset by peer]
15:53:09Carnildo joins
15:56:28Perk quits [Read error: Connection reset by peer]
15:58:23etnguyen03 quits [Client Quit]
15:59:04etnguyen03 (etnguyen03) joins
16:08:50etnguyen03 quits [Client Quit]
16:10:20Carnildo quits [Read error: Connection reset by peer]
16:10:32Carnildo joins
16:20:43Carnildo quits [Read error: Connection reset by peer]
16:20:59Carnildo joins
16:27:41JaffaCakes118 quits [Ping timeout: 265 seconds]
16:30:01Carnildo quits [Read error: Connection reset by peer]
16:30:21Carnildo joins
16:36:20Carnildo quits [Remote host closed the connection]
16:36:51Carnildo joins
16:39:48<gaz>hey peeps, i'm looking for some advice or tips: i want to download absolutely everything associated with a few domains from the wayback machine (all subdomains, images, js, css, etc etc etc). my initial investigations put what i want to grab at like 30 million urls, and would take like 6 months on one machine. i'm hoping you guys have info that
16:39:49<gaz>could help :)
16:51:59Carnildo quits [Read error: Connection reset by peer]
16:52:12Carnildo joins
16:54:43<that_lurker>gaz: Easiest might be to try and search for the domain in https://archive.fart.website/archivebot/viewer/?q=utu.fi and download the associated .warc.gz file
16:54:59<that_lurker>correction the link is https://archive.fart.website/archivebot/viewe
16:55:34that_lurker wonder how one can send the wrong link twice
16:55:42<gaz>ok i'll have a look
16:55:43<gaz>lol
16:56:00<@JAA>That will only work if it's an ArchiveBot crawl, of course. You wouldn't get snapshots from other sources etc.
16:56:26<@JAA>But yeah, if there is such a crawl, it's probably a good start.
16:56:53<that_lurker>yeah. forgot to mention that too.... The sudden summer heat in Finland is getting to me :-P
16:59:48<Vokun>Ah yes. With a high of just above freezing, i'd be sweating too
17:00:37Carnildo_again joins
17:00:38Carnildo quits [Read error: Connection reset by peer]
17:02:14f_ quits [Remote host closed the connection]
17:03:00f_ (funderscore) joins
17:03:31eightthree quits [Ping timeout: 255 seconds]
17:03:33<Vokun>Actually, sorry. It's too hot where I live too
17:04:18Carnildo_again quits [Remote host closed the connection]
17:04:25Carnildo joins
17:04:52eightthree joins
17:09:01f_ quits [Remote host closed the connection]
17:10:07RealPerson leaves
17:10:32RealPerson joins
17:11:03f_ (funderscore) joins
17:11:42<that_lurker>These are the first days when its starting to go over 10 C here during the day. Nights are still around 5 C and now days are somewhere close to 20 C
17:14:06Carnildo quits [Read error: Connection reset by peer]
17:14:13Carnildo joins
17:14:37<Larsenv>it's https://archive.fart.website/archivebot/viewer/
17:18:34<Vokun>It goes from about 12-28 here from night to day. I run a fan at night from the window while I sleep cause it doesn't cool down till really late at night, so when I wake up i'm fridged.
17:19:55f_ quits [Ping timeout: 250 seconds]
17:20:55that_lurker watches for the looming gaze of JAA as the conversation has gone offtopic and wonders whether to continue or not :P
17:21:02f_ (funderscore) joins
17:21:21<@JAA>:-)
17:26:02Carnildo quits [Remote host closed the connection]
17:26:09Carnildo joins
17:43:30Island quits [Read error: Connection reset by peer]
18:11:00shgaqnyrjp_ is now known as shgaqnyrjp
18:11:11Carnildo quits [Read error: Connection reset by peer]
18:11:14Carnildo joins
18:18:27Notrealname1234 (Notrealname1234) joins
18:20:10Carnildo quits [Read error: Connection reset by peer]
18:20:34Carnildo joins
18:22:48etnguyen03 (etnguyen03) joins
18:27:44Carnildo quits [Read error: Connection reset by peer]
18:27:49Carnildo joins
18:28:15Notrealname1234 quits [Client Quit]
18:36:53Carnildo_again joins
18:37:13Notrealname1234 (Notrealname1234) joins
18:37:34Carnildo quits [Ping timeout: 255 seconds]
18:48:25sd (sd) joins
18:48:33Carnildo_again quits [Read error: Connection reset by peer]
18:48:38Carnildo joins
18:49:53Notrealname1234 quits [Client Quit]
19:00:55benah joins
19:03:45etnguyen03 quits [Client Quit]
19:04:05benah quits [Client Quit]
19:05:26Carnildo quits [Read error: Connection reset by peer]
19:05:44Carnildo joins
19:08:28Carnildo quits [Read error: Connection reset by peer]
19:08:30Carnildo joins
19:12:52Carnildo_again joins
19:12:52Carnildo quits [Read error: Connection reset by peer]
19:19:31Wohlstand quits [Client Quit]
19:20:02Carnildo_again quits [Read error: Connection reset by peer]
19:22:06Carnildo joins
19:22:07f_ quits [Ping timeout: 250 seconds]
19:27:04Carnildo quits [Ping timeout: 255 seconds]
19:27:30Carnildo joins
19:36:18<Ryz>Heya folks, does anyone wanna help me extract subdomains of http://htmlplanet.com/ ? I found loads of it through https://www.subdomain.center/ and might've found 900 of 'em and I'm planning to run 'em all in AB (can't be HTTPS curiously, it's HTTP only!)
19:36:31<Ryz>I'm...I'm trying to recall if there's a IRC channel dedicated to this <#>;
19:37:08<Ryz>I was initially going to say #webroasting - but that's specifically for ISP hosting websites
19:57:41Wohlstand (Wohlstand) joins
20:08:55kiryu quits [Ping timeout: 255 seconds]
20:11:10jasons quits [Ping timeout: 255 seconds]
20:16:10jasons (jasons) joins
20:26:01jasons quits [Ping timeout: 255 seconds]
20:37:15nertzy_ joins
20:41:11Wohlstand quits [Client Quit]
20:43:22<that_lurker>Ryz: Quick scan found 610. Most likely the same you already got though https://transfer.archivete.am/inline/Gd4Ub/htmlplanetsubdomains.txt
20:44:27sec^nd quits [Ping timeout: 250 seconds]
20:46:24etnguyen03 (etnguyen03) joins
20:50:44sec^nd (second) joins
20:55:36Webuser536 joins
20:56:32no-n0rth quits [Client Quit]
20:57:26<Webuser536>If this is the chat to be talking about this, is there a way of properly using wget to download files from the Wayback Machine?
21:06:45<Ryz>that_lurker, this is from WBM CDX I assume? oo;
21:07:22<that_lurker>Got those by doing a scan with Sublist3r
21:07:45<Ryz>Hello Webuser536, please go to #internetarchive for a better chance of your question being answered
21:08:48<Ryz>that_lurker, go for a WBM CDX please if you can, there might be more subdomains there
21:16:02etnguyen03 quits [Client Quit]
21:16:06<that_lurker>Ryz, Not finding anything at least with https://web.archive.org/cdx/search/cdx?url=*.htmlplanet.com/&matchType=domain
21:18:16<Ryz>Hmm, there has to be more... :C
21:24:46<that_lurker>You could maybe do some hardcore bruteforcing, but that would take a while
21:27:26Island joins
21:34:47Notrealname1234 (Notrealname1234) joins
21:40:43pedantic-darwin quits [Ping timeout: 255 seconds]
21:52:24<pokechu22>try https://web.archive.org/cdx/search/cdx?url=htmlplanet.com&matchType=domain&collapse=urlkey&fl=original&limit=10000&showResumeKey=1&resumeKey=
22:01:33shgaqnyrjp quits [Remote host closed the connection]
22:01:43<that_lurker>oh that found a lot
22:01:59shgaqnyrjp (shgaqnyrjp) joins
22:03:38pedantic-darwin joins
22:09:14<pokechu22>yeah, and you can copy the thing at the bottom and put it into the resumeKey parameter to get more
22:11:08<@JAA>60% of the time, it works every time!
22:13:08<fireonlive>*JAA CDX api flashback horror stories*
22:14:43<Notrealname1234>"JAA" CDX api!
22:15:01<@JAA>Also, little-things/ia-cdx-search is a thing. :-)
22:17:17<@JAA>I guess it might work fine in this case.
22:17:35<@JAA>The resumeKey-based pagination, I mean.
22:20:59<that_lurker|m>I looked at, little-things/ia-cdx-search today and totally forgot about it when i needed it :-)
22:25:33Notrealname1234 quits [Client Quit]
22:26:01Notrealname1234 (Notrealname1234) joins
22:27:30<that_lurker|m>JAA thanks for making those amazing scripts available
22:27:58<Notrealname1234>Wonderful scripts
22:28:08<@JAA>:-)
22:28:13<that_lurker|m>JAA++
22:28:14<eggdrop>[karma] 'JAA' now has 37 karma!
23:03:31Notrealname1234 quits [Client Quit]
23:23:43sec^nd quits [Remote host closed the connection]
23:24:06sec^nd (second) joins
23:26:07lunik1 quits [Client Quit]
23:26:55lunik1 joins
23:39:21Guest77 joins
23:39:48<Guest77>Hello! what is the best way to handle '.warc' files? i have tested a bit the 'grab-site' program but i am clueless on how to treat the .warc file as an 'extractible' file. I would like to see and select which files to extract as one usually does with .zip and other compressed files. zless shows the raw data but it is not the best way
23:50:24Island quits [Read error: Connection reset by peer]