#archiveteam-bs log for 2024-01-03

Home Search Previous day Next day

00:04:25	<mgrandi>	But yeah I'll take a look tonight and see if it works for portal maps
00:27:35	<Pedrosso>	Great. I wouldn't like to be the one to do it but if nobody else would at the moment, then perhaps
00:32:43	<mgrandi>	Poke me if I forget but I'll spend some time to ight documenting it and put it on GitHub
00:34:29		tbc1887 quits [Client Quit]
00:35:14		kiryu joins
00:35:14		kiryu is now authenticated as kiryu
00:35:14		kiryu quits [Changing host]
00:35:14		kiryu (kiryu) joins
00:39:11		tbc1887 (tbc1887) joins
01:15:14		pabs quits [Remote host closed the connection]
01:29:49		Naruyoko5 joins
01:31:50		Naruyoko quits [Ping timeout: 240 seconds]
01:48:28	<flashfire42>	https://en.wikipedia.org/wiki/Republic_of_Artsakh has just ceased to exist in the last few days
01:50:26	<pokechu22>	Fortunately thanks to gooshka we've been running https://artsakhlib.am/ for a while (but that site's also super slow and errors out if you run it faster)
01:50:39	<pokechu22>	I think we might have run some of their other sites too, but it'd be good to re-run them
02:18:53		enim3n joins
02:21:32		enim3n quits [Remote host closed the connection]
02:21:32		qwertyasdfuiopghjkl quits [Remote host closed the connection]
02:31:58		pabs (pabs) joins
04:02:54		Hackerpcs quits [Quit: Hackerpcs]
04:06:38		Hackerpcs (Hackerpcs) joins
04:37:24		atphoenix_ quits [Remote host closed the connection]
04:38:08		atphoenix_ (atphoenix) joins
05:25:24		BlueMaxima quits [Read error: Connection reset by peer]
05:36:20		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
06:03:50		nulldata quits [Ping timeout: 240 seconds]
06:07:31		Arcorann (Arcorann) joins
06:09:16		nulldata (nulldata) joins
06:13:52		Island quits [Read error: Connection reset by peer]
06:42:18		c3manu (c3manu) joins
06:56:12		sec^nd quits [Remote host closed the connection]
06:58:07		sec^nd (second) joins
07:27:01	<@arkiver>	i'm looking into a project for hardware info since they have crazy strict rate limits
07:30:17		c3manu quits [Remote host closed the connection]
07:30:46	<@arkiver>	if there are any "official" sites or youtube channels of https://en.wikipedia.org/wiki/Republic_of_Artsakh we should archive them in #archivebot (sites) and #down-the-tube (youtube)
07:32:05	<flashfire42>	https://www.spyur.am/
07:34:38	<fireonlive>	arkiver: hardware info?
07:42:59	<@arkiver>	fireonlive: went read only on january 1, see deathwatch
07:43:39	<fireonlive>	ah! thanks
07:43:51	<fireonlive>	i went google instead for some reason -_-
07:44:01	<fireonlive>	i of all people should know the wiki :D
07:50:47		DogsRNice quits [Read error: Connection reset by peer]
07:51:43	<pokechu22>	https://www.spyur.am seems to have strict cloudflare unfortunately :/
07:52:02	<pokechu22>	(though it also sounds like it's Armenia in general, not Artsakh)
07:52:26		rdosus joins
08:29:15		rdosus quits [Client Quit]
08:41:33		Megame (Megame) joins
08:52:03	<fireonlive>	manu\|m & others here: looks like c3 is going to be deleting a number of matrix channels for the event (via the irc bridge to #37c3-hall-1 and others)
08:52:19	<fireonlive>	unsure if there's a way to save matrix channels & its attachments/threads/etc or ?
08:52:33	<fireonlive>	message: <admintechnicaladministrationnoc> PSA: This channel is a candidate for deletion. If you think this is a mistake, please let us know by replying to this message. Otherwise we are going to delete the channel in a few days. Thanks for using the matrix event chat, we are happy to hear your feedback:
08:52:33	<fireonlive>	https://events.ccc.de/congress/2023/hub/wiki/Feedback/
08:53:14	<fireonlive>	account for that seems to be @admin:events.ccc.de
08:53:19	<fireonlive>	(/whois)
09:02:51	<fireonlive>	(just in bed, but just got a ping they're going this for data privacy reasons before we rush into this)
09:03:07	<fireonlive>	(will respond/ask qs tomorrow)
09:17:53	<@arkiver>	JAA: what directory contained the bulk of the size of archive.mozilla.org from your recent scan?
09:18:04	<@arkiver>	(CC Ryz )
10:00:01		Bleo18260 quits [Client Quit]
10:01:21		Bleo18260 joins
10:17:52		qwertyasdfuiopghjkl quits [Remote host closed the connection]
10:46:31		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
11:30:25		eroc1990 quits [Read error: Connection reset by peer]
11:32:24		eroc1990 (eroc1990) joins
12:03:22		bocci (bocci) joins
12:35:20		le0n quits [Ping timeout: 240 seconds]
12:40:41		le0n (le0n) joins
12:46:15		Arcorann quits [Ping timeout: 272 seconds]
13:11:27		bocci quits [Remote host closed the connection]
13:16:16		bocci (bocci) joins
13:36:34		kiryu_ joins
13:40:05		kiryu quits [Ping timeout: 272 seconds]
13:41:21		kiryu_ quits [Ping timeout: 272 seconds]
13:51:47		jacksonchen666 is now authenticated as *
13:51:47		jacksonchen666 is now known as RJHacker67335
13:51:52		jacksonchen666 (jacksonchen666) joins
13:52:33		jacksonchen666 quits [Client Quit]
13:53:21		RJHacker67335 quits [Ping timeout: 255 seconds]
14:16:49		Naruyoko5 quits [Ping timeout: 272 seconds]
14:18:11		qwertyasdfuiopghjkl quits [Remote host closed the connection]
14:53:46		qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
14:58:56		BenjaminKrausseDB joins
15:03:10	<BenjaminKrausseDB>	Hi all, I'm trying to get an old download from the Microsoft Download Center, which no longer seems to be available. I stumbled upon this page (https://wiki.archiveteam.org/index.php/Microsoft_Download_Center) which states that everything was archived. I found the file I'm looking for in the index (msxml6_SDK.msi), and the way I understand it, that
15:03:10	<BenjaminKrausseDB>	file should be findable in https://archive.org/details/archiveteam_microsoft_download?sort=title . However, I am completely confused as to how to find the file there. It seems to me that a number of files are bunched together into large downloads, but I can't figure out for the life of me in which one of those large downloads the file I'm looking
15:03:11	<BenjaminKrausseDB>	for is located. Is there any documentation or something that I'm missing?
15:06:26	<@Sanqui>	BenjaminKrausseDB: You probably want to download the index https://archive.org/details/microsoft_download_center_html_index_2020-08 which will tell you which warc contains which URL (file), and then use something like pywb to replay the warc and extract the file.
15:11:07	<BenjaminKrausseDB>	Thanks for the link, I found the file I'm looking for in there, I'm just not sure where to go from there. Or is it the ID I'm looking for?
15:12:12	<BenjaminKrausseDB>	Essentially this is what I found:
15:12:22	<BenjaminKrausseDB>	~~~<h3 id="3988"><a href="#3988">•</a>Microsoft Core XML Services (MSXML) 6.0 </h3><p>MSXML 6.0 (MSXML6) has improved reliability, security, conformance with the XML 1.0 and XML Schema 1.0 W3C Recommendations, and compatibility with System.Xml 2.0.</p>
15:12:22	<BenjaminKrausseDB>	href="https://web.archive.org/web/20200801/https://www.microsoft.com/en-us/download/details.aspx?id=3988">Original page</a>)</p>
15:12:23	<BenjaminKrausseDB>	href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6.msi">msxml6.msi</a> (1.5MB)</p>
15:12:23	<BenjaminKrausseDB>	href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi">msxml6_ia64.msi</a> (3.6MB)</p>~~~
15:13:11	<@Sanqui>	Those archive.org links seem work for me and start a download
15:13:18	<@Sanqui>	so I guess that's exactly what you want!
15:13:45		BenjaminKrausseDB2 joins
15:14:12	<@Sanqui>	BenjaminKrausseDB2: <@Sanqui> Those archive.org links seem work for me and start a download
15:14:12	<@Sanqui>	<@Sanqui> so I guess that's exactly what you want!
15:14:51	<BenjaminKrausseDB>	OK, weird, they're not working here. I'll try those links on a different device...
15:16:15	<@Sanqui>	if the download doesn't start, try putting "id_" after the timestamp in the url, as such:
15:16:20	<@Sanqui>	https://web.archive.org/web/20200803205234id_/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi
15:16:27	<@Sanqui>	might have better compatibility
15:18:18	<nicolas17>	yes that goes directly to a 3MB binary file
15:18:51	<BenjaminKrausseDB>	OK, it worked on my phone. I suspect my work network is blocking something (although usually it says something, not sure what my IT department pulled off this time). Thanks for the help!
15:19:22	<@Sanqui>	No prob, good luck getting that Itanic working!
15:19:50		nic9070 quits [Ping timeout: 240 seconds]
15:20:05		nic9070 (nic) joins
15:20:49		bocci_ joins
15:23:57		bocci quits [Ping timeout: 272 seconds]
15:26:08	<BenjaminKrausseDB>	Thanks, I think I'll need the luck the way this has been going up until now '=D
15:26:25		BenjaminKrausseDB2 quits [Ping timeout: 265 seconds]
15:29:27		Naruyoko5 joins
15:31:00		treora quits [Remote host closed the connection]
15:31:02		treora joins
15:31:09		treora quits [Remote host closed the connection]
15:31:11		treora joins
15:40:01	<BenjaminKrausseDB>	Got it working! Thanks for the help and all the work you guys do!
15:40:33		BenjaminKrausseDB2 joins
15:40:47		BenjaminKrausseDB quits [Remote host closed the connection]
15:45:30		BenjaminKrausseDB2 quits [Remote host closed the connection]
15:58:20		bocci_ quits [Ping timeout: 240 seconds]
15:59:14		bocci_ joins
16:03:23		Deewiant quits [Remote host closed the connection]
16:20:19		riku quits [Ping timeout: 272 seconds]
16:21:26	<fireonlive>	^_^
16:51:14		Deewiant (Deewiant) joins
17:10:24		JayEmbee quits [Quit: WeeChat 2.3]
17:12:53		bocci_ quits [Read error: Connection reset by peer]
17:14:32		bocci_ joins
17:15:20		riku (riku) joins
17:19:24		Kitty (Kitty) joins
17:20:29		bocci_ quits [Ping timeout: 272 seconds]
17:21:15		bocci_ joins
17:27:42		treora quits [Remote host closed the connection]
17:27:45		treora joins
17:42:17		treora quits [Remote host closed the connection]
17:42:21		treora joins
17:50:48		poetav__ joins
17:53:20		bocci_ quits [Ping timeout: 240 seconds]
17:53:42	<h2ibot>	FireonLive edited Deathwatch (+371, add bear.community): https://wiki.archiveteam.org/?diff=51457&oldid=51455
17:53:49	<fireonlive>	that was fast
17:54:23	<fireonlive>	luck of the cron
17:57:19		bocci_ joins
17:57:42	<h2ibot>	FireonLive edited Current Projects (+78, add pastebin): https://wiki.archiveteam.org/?diff=51458&oldid=51407
17:57:43	<h2ibot>	FireonLive edited Pastebin (+24, DPoS): https://wiki.archiveteam.org/?diff=51459&oldid=47706
17:59:42	<h2ibot>	FireonLive edited Pastebin (+23, add CTA, make more secure): https://wiki.archiveteam.org/?diff=51460&oldid=51459
18:00:50		poetav__ quits [Ping timeout: 240 seconds]
18:03:56		Island joins
18:07:28		JayEmbee (JayEmbee) joins
18:39:18		IRC2DC joins
18:50:17	<thuban>	speaking of pastebin, i've noticed that the project code makes no attempt to extract outlinks from paste content. is that a deliberate choice?
18:55:47	<fireonlive>	hmmm. lots of spam there, but i think it's an older project so maybe not?
18:56:33	<thuban>	yeah, hence my uncertainty
18:58:43	<fireonlive>	arkiver?
19:00:06	<thuban>	could be a good source for links to filesharing projects (like mediafire or zippyshare) since it's often used as an agglomerator
19:01:28	<thuban>	(i know of at least one subreddit that bans download links, to avoid the attention of site admins, but tacitly encourages pastebins of same)
19:12:53	<bocci_>	speaking of hid URLs, have projects ever made an effort to catch base64 encoded urls
19:14:07	<bocci_>	using rot13 or base64, some file sharing communities hide mega, mediafire URLs from bots that issue DMCA takedowns
19:15:33	<nicolas17>	I question if those particular links are the kind of thing we want to archive >.>
19:16:10		Doranwen quits [Quit: bbl]
19:16:20	<bocci_>	sure
19:18:15	<thuban>	bocci_: no, afaik no projects have ever implemented that kind of filter-evasion matching
19:18:18	<thuban>	(there's some attempt to repair broken urls, but mainly for accidental syntax-mangling)
19:19:01	<bocci_>	thanks, i just wanted to know/make it known
19:19:56	<bocci_>	an example of a history of these encoded links being used:
19:19:57	<bocci_>	https://warosu.org/ic/thread/6960541#p6960541
19:20:16	<thuban>	nicolas17: it can be legit. i remember doing a bunch of those manually during the zippyshare project--they were video game mods from some forum crawl
19:28:09	<fireonlive>	!tell Doranwen do you have a wiki account?
19:28:09	<eggdrop>	[tell] ok, I'll tell Doranwen when they join next
19:28:43	<fireonlive>	ah yeah, base64 has been used a lot in /r/piracy wiki i think?
19:28:47	<fireonlive>	or some reddit wiki
19:29:22	<bocci_>	for the record, the strings aren't random or encrypted
19:29:35	<bocci_>	a base64-encoded https link always starts with aHR0cHM6Ly
19:30:11	<bocci_>	and mediafire links aren't hard to spot once you memorize the pattern
19:30:25	<bocci_>	https://www.mediafire.com/file/not-real
19:30:31	<bocci_>	https://www.mediafire.com/file/some-file
19:30:42	<bocci_>	aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL25vdC1yZWFsCg==
19:30:47	<bocci_>	aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL3NvbWUtZmlsZQo=
19:31:03	<fireonlive>	ig yu'd want to look for aHR0cHM6Ly8 and aHR0cDovLw (https:// and http://)
19:31:10	<fireonlive>	oh no 8
19:31:25	<fireonlive>	interesting idea though i like it
19:33:10	<thuban>	would miss protocol-stripped links, but you'd have to get really aggressively heuristic to catch the general case, soz
19:33:16	<thuban>	interesting, i concur
19:35:40	<bocci_>	i think you can find protocol-stripped links automatically without some crazy heuristic
19:35:47	<bocci_>	if you limit yourself to some hosts
19:36:32	<bocci_>	d3d3Lm1lZGlhZmlyZS5jb20K = www.mediafire.com
19:37:08	<bocci_>	it's such a specific string, you wouldn't have any false positives
19:37:21	<thuban>	correct, but due to the way we backfeed discovered urls between projects, that could get awkward to maintain
19:38:27	<bocci_>	i have no idea about that
19:41:18	<fireonlive>	i suppose for pastebin itself someone could make something bespoke to scrape the warcs
19:48:07	<thuban>	fireonlive: someone has :P
19:51:24	<thuban>	by which i mean JAA's done a horrible one-liner a couple of times.
19:54:39	<thuban>	bocci_: basically, if a project discovers outlinks, it sends them to the general urls project (#//), which checks them against the list of site-specific projects and forwards them appropriately if there's a match
19:54:48	<thuban>	if every project were to discover obfuscated outlinks to a specific list of hosts, then every project would need the list of site-specific projects
19:55:42	<thuban>	and keeping an n:n system consistent is hell compared to 1:n
19:55:43	<fireonlive>	ah :D
19:56:51	<fireonlive>	hmmmm. i guess you could use those 'indicators' for b64 http/https and do further local processing if found?
19:57:00	<fireonlive>	then ship it to urls as normal?
19:57:14	<thuban>	right
19:57:25	<fireonlive>	sounds fun :)
20:14:41	<fireonlive>	-+rss- Niklaus Wirth Passed Away: https://twitter.com/Bertrand_Meyer/status/1742613897675178347 https://news.ycombinator.com/item?id=38858012
20:14:42	<eggdrop>	nitter: https://nitter.net/Bertrand_Meyer/status/1742613897675178347
20:16:25	<qwertyasdfuiopghjkl>	You would also need to account for all the different possible capitalizations of http:// and https:// since that would change the base64
20:38:49		BlueMaxima joins
20:45:31		c3manu (c3manu) joins
21:17:26	<nicolas17>	iOS 17.3 beta 2 was released today, and soon it was discovered that it caused iPhones with a certain feature enabled to boot-loop, so 3 hours later it was pulled from the update server
21:18:27	<nicolas17>	they might delete the actual files from the CDN too... sum of all variants is 239GB, is this too much? would it work on AB or urls?
21:18:29	<nicolas17>	JAA: ^
21:22:36		Naruyoko5 quits [Remote host closed the connection]
21:22:57		Naruyoko5 joins
21:23:59	<bocci_>	dumb question: what's wrong with just downloading the files and uploading to an archive.org collection if you wish to archive them
21:24:47	<nicolas17>	I could, and I have done that for files that were already deleted but I recovered from elsewhere
21:25:07	<nicolas17>	but then it won't work on WBM
21:25:22	<bocci_>	oh
21:26:03	<bocci_>	i've felt wrong for using the WBM for large files
21:26:04	<nicolas17>	and with my Internet it would take 20 hours to upload, but upload speeds to IA are usually worse
21:27:35	<bocci_>	i kinda had the sense that directly hitting images/files on the WBM was an unintended effect of saving web pages
21:28:05	<bocci_>	wayback machine is for webpages
21:28:12	<bocci_>	i think im wrong
21:28:21	<nicolas17>	idk, that's why I'm asking first :P
21:35:25	<thuban>	bocci_: nothing wrong with having files in the wbm--in fact it's good, because it's more authoritative _and_ more discoverable than just having them somewhere on archive.org
21:35:30	<thuban>	(if you find a link somewhere and it's dead, it's a lot easier to plug the url into the wbm than to search around and maybe find a relevant item and maybe find the file within the item and hope it's correct)
21:35:34	<thuban>	buuut there's a lot of duct tape involved, so idk how large is too large either
21:36:37	<nicolas17>	it's 34 files from 6363 MiB to 7756 MiB
21:37:28	<bocci_>	in total or each?
21:38:02	<nicolas17>	as I said total is 239GB x_x
21:39:00	<nicolas17>	MiB\|url: https://paste.debian.net/1302977/
21:43:38	<pokechu22>	nicolas17: doing it via AB is probably fine
21:44:25	<pokechu22>	just got to make sure it ends up on firepipe (1.44 TiB free) or addax (524 GiB free) per http://archivebot.com/pipelines
21:45:54	<pokechu22>	an !ao < list of https://transfer.archivete.am/inline/zkuP2/ios_17.3_beta_2_cdn_urls.txt (which deliberately includes that paste at the top as a small file) should be fine, I'll run it unless you've got a different plan
22:01:49		Megame quits [Client Quit]
22:22:02		jacksonchen666 (jacksonchen666) joins
22:35:27		c3manu quits [Remote host closed the connection]
22:50:41		neggles quits [Quit: bye friends - ZNC - https://znc.in]
22:52:55		neggles (neggles) joins
22:56:59		simon816 quits [Remote host closed the connection]
22:59:36	<@JAA>	arkiver: Re archive.mozilla.org, I don't remember, but I believe I posted the link to the full JSONL scan output here some weeks ago.
23:00:42	<audrooku\|m>	Is jsonl the same as ndjson?
23:02:26	<@JAA>	thuban: Can confirm, have written such horrible one-liners. 60% of the time, they work every time!
23:03:05	<@JAA>	audrooku\|m: Yes
23:04:33	<@JAA>	Also referred to as 'JSON Lines' and some other variations. But .jsonl is the common file extension, and application/jsonl is the proposed media type.
23:05:13	<@JAA>	Also 'Line-Delimited JSON', which has absolutely no potential of confusion with the entirely unrelated JSON-LD.
23:05:50		Hackerpcs quits [Ping timeout: 240 seconds]
23:06:14	<@JAA>	nicolas17, pokechu22: Yes, fine with AB. Large pipeline's a good idea, but if all pipelines are full, !ao < should end up on firepipe-ao anyway (unless that's full as well, didn't check).
23:06:41	<@JAA>	(Of course, firepipe-ao won't run jobs queued with --pipeline.)
23:07:20	<thuban>	<@JAA> arkiver: [...] I believe I posted the link to the full JSONL scan output here some weeks ago.
23:07:21	<pokechu22>	It looked good as of an hour ago (I also see you got rid of addax-ao, which I guess makes sense because firepipe-ao receives jobs much faster)
23:07:23	<thuban>	https://transfer.archivete.am/a0mjU/archive.mozilla.org-files.jsonl.zst
23:07:29	<thuban>	(https://hackint.logs.kiska.pw/archiveteam-bs/20231118#c390573)
23:08:56	<@JAA>	pokechu22: Yeah, that's why. jap-addax-ao was taking a minute or more to dequeue a job, just horrendous.
23:09:34		Hackerpcs (Hackerpcs) joins
23:10:08	<pokechu22>	It's running (ab job ew2dbtuft08uz2xe0tf4lhlcv)
23:12:54	<@JAA>	:-)
23:13:26	<fireonlive>	^_^
23:13:55	<thuban>	JAA, any thoughts on the wiki changes suggested in #//?
23:27:34	<nulldata>	https://www.polygon.com/24024266/kim-kardashian-mobile-game-shutting-down-glu-mobile
23:30:50		simon816 (simon816) joins
23:30:51	<nulldata>	https://www.eurogamer.net/stray-souls-developer-shuts-down-following-publishers-closure-cyberbullying-and-poor-sales
23:31:40	<nulldata>	Doesn't look like Stray Souls has a website anymore, but they do have a Twitter if someone could throw it in AB. https://twitter.com/jukaistudio
23:31:41	<eggdrop>	nitter: https://nitter.net/jukaistudio
23:34:28		simon816 quits [Client Quit]
23:35:42	<fireonlive>	added it to next on the pad for when one of the two active finish
23:44:39	<nicolas17>	https://developer.apple.com/documentation/ios-ipados-release-notes/ios-ipados-17_3-release-notes now finally acknowledging the issue
23:47:24	<fireonlive>	archivebotted
23:50:50		simon816 (simon816) joins

Home Search Previous day Next day