00:27:29jacksonchen666 is now known as RJHacker1205
00:27:33jacksonchen666 (jacksonchen666) joins
00:30:26RJHacker1205 quits [Ping timeout: 245 seconds]
00:48:54Arcorann (Arcorann) joins
01:15:57<pabs>AntoninDelFabbro|m: which website, what are you trying to get? I tend to use different things for different purposes. for eg: googler/ddgr for site: search engine queries. curl/wget for downloads. pup for HTML parsing/querying. jq for JSON querying
01:42:21Arcorann quits [Ping timeout: 265 seconds]
01:44:42Arcorann (Arcorann) joins
01:44:53Megame quits [Ping timeout: 252 seconds]
02:05:49erkinalp joins
02:06:57erkinalp quits [Remote host closed the connection]
02:19:39<h2ibot>Systwi uploaded File:Duck Hunt (World)-0--twitter-5.png (Mr. Peepers holding the Twitter bird,…): https://wiki.archiveteam.org/?title=File%3ADuck%20Hunt%20%28World%29-0--twitter-5.png
02:27:41<h2ibot>Systwi edited Twitter (+177, /* Vital Signs */ Added meme and serious caption.): https://wiki.archiveteam.org/?diff=50593&oldid=50591
02:35:29<fireonlive>systwi: 😃
02:39:43<h2ibot>Systwi edited Site exploration (+398, /* Twitter */ Mentioned Nitter and Twitter's…): https://wiki.archiveteam.org/?diff=50594&oldid=50492
02:40:02<systwi>fireonlive: :-D
03:15:20Megame (Megame) joins
03:15:40parfait (kdqep) joins
03:26:14mindstrut quits [Read error: Connection reset by peer]
03:26:41mindstrut joins
03:51:49krvme joins
03:52:01fluke quits [Ping timeout: 258 seconds]
03:52:12fluke joins
03:55:45Krume quits [Ping timeout: 265 seconds]
04:17:31DogsRNice quits [Read error: Connection reset by peer]
04:20:53dazld quits [Ping timeout: 265 seconds]
04:44:41erkinalp joins
04:54:05<erkinalp>wowturkey is down, not known it's temporary and re-restored or permanent as in finally closed
04:54:40dazld (dazld) joins
04:54:48<erkinalp>let's leave the bot running in hope it returns once more
04:55:02<erkinalp>the announced date was august 31
05:01:28IDK (IDK) joins
05:02:15<pabs>nicolas17: anything to save for this? https://www.volkerkrause.eu/2023/08/26/kde-jenkins-retirement-progress.html
05:03:50<nicolas17>pabs: I doubt it because most data there was ephemeral in the first place, eg. there's projects that do a daily build and only the last 5 binaries are kept
05:04:02BigBrain quits [Remote host closed the connection]
05:04:04<pabs>and what about the phabricator?
05:04:23<erkinalp>phabricator is a bugtracker, that's significant
05:04:24BigBrain (bigbrain) joins
05:04:41<erkinalp>tickets may have good things
05:05:34<nicolas17>we'll probably turn it into static pages somehow
05:05:55<nicolas17>I'm not sure how easy it is to archive, I think there's like, JS-backed "load more comments" stuff?
05:05:57pabs recommends an AB job, then download the static files :)
05:06:30<pabs>I did a phabricator recently, apart from the large amount of ignores I think it worked ok
05:07:38<erkinalp>we'd have to do the same with missing wowturkey viewtopic pages with p=### links
05:08:06<erkinalp>corresponding t=####&start=### ones have already been crawled
05:08:27<nicolas17>at one point we considered moving issues from phabricator to gitlab and it was messy because tickets can have multiple tags/projects that they belong to, while gitlab issues belong to *one* project
05:09:15<nicolas17>so we would need to check case by case and make a list of "if a ticket has tag X and tag Y, put it in repo Y"
05:09:58<pabs>AB job then static seems better
05:10:18<nicolas17>well yeah, this was *early* in the gitlab move when a lot of tickets would still be active
05:10:25BigBrain quits [Remote host closed the connection]
05:11:38<nicolas17>by now I guess a lot was closed, or stopped mattering, or was still active and someone moved it manually
05:12:31<flashfire42>Ok should I focus on webs or orange today? both have close cut off dates. Or do I say the hell with the both of them and continue with the aussie ISPs that have technically passed their shutdown date and are still up?
05:14:39<erkinalp>"September 1: wowTURKEY[IA•Wcite•.today•MemWeb], a large Turkish photo sharing forum[23]" september 1 → august 27
05:15:00<erkinalp>don't kill the archivebot crawler tho
05:15:12<erkinalp>it still crawls previously failed external links
05:17:18<erkinalp>if we had started this crawl one day before, we would have the full archive today...
05:17:44dumbgoy_ quits [Ping timeout: 252 seconds]
05:17:45<flashfire42>Alas that is the joys of web archival
05:17:50<flashfire42>things are lost every day my friend
05:18:01hitgrr8 joins
05:18:07<flashfire42>and it sucks. it does. but we do what we can
05:18:15<erkinalp>87% is better than nothing
05:18:38<erkinalp>(known item count is ~9.4M, we got 8,338,246)
05:23:32<pabs>https://www.science.org/content/article/government-seizure-nicaraguan-university-blow-science-researchers-say
05:23:49<erkinalp>300ms is too short for wowturkey, the server's own delay was about 500ms even when it was up
05:26:02BigBrain (bigbrain) joins
05:36:25<pokechu22>flashfire42: I should probably do an !a < list job on orange - webs might be better to focus on. On the other hand webs has the stupid calendars that make things a mess :|
05:37:10<flashfire42>orange has about 4 different subdomains and some of them dont even resolve for me but do for others. Webs is not a set and forget thing which is really what I am aiming for because the calenders are so fucking broken
05:38:46sec^nd quits [Ping timeout: 245 seconds]
05:39:20sec^nd (second) joins
06:04:48dazld quits [Ping timeout: 265 seconds]
06:06:42dazld (dazld) joins
06:15:01sec^nd quits [Ping timeout: 245 seconds]
06:28:06sec^nd (second) joins
06:43:31<pokechu22>Can I get a list of those different subdomains? (Lists of individual sites would be useful too but I have some ideas of how to get those once I know the starting points)
06:46:51<flashfire42>pagesperso-orange.fr
06:46:51<flashfire42>monsite-orange.fr
06:46:53<flashfire42>those are the main 2
06:47:06<flashfire42>images are hosted on cdn.woopic.com
06:50:26krvme quits [Read error: Connection reset by peer]
06:56:56<DigitalDragons>a few individual sites: https://crawlyproject.digitaldragon.dev/cds/lists/fr/pagesperso-orange/
07:04:08<DigitalDragons>(ignore the .txt at the end of everything)
07:09:23Unholy2361316618085 quits [Ping timeout: 252 seconds]
07:09:33<AntoninDelFabbro|m>pabs: I wqnt to download https://annuaire-pp.orange.fr/accueil, but thanks for your help! :D
07:11:42<pabs>a good option for that is open it in your browser, open dev tools, click on all the things on the site, then save all the requests as a .har and then AB all the URLs output by this shell oneliner:
07:11:44<pabs>for f in *.har ; do jq -r '.log.entries[].request.url' < "$f" ; done | sort -u
07:12:12<pabs>ah, better open dev tools before loading the page, woops
07:12:49<pabs>there are some browser based crawler things on the wiki somewhere too
07:13:15<pabs>but they may not work if you need to interact with the site
07:14:27Megame quits [Client Quit]
07:20:28<AntoninDelFabbro|m>Gold! I just woke up, but I'm impatient to try this asap! Thank you!
07:26:26Perk quits [Ping timeout: 252 seconds]
07:27:16Perk joins
07:31:24<erkinalp>qyxojzh|m: wowturkey definitively down
07:42:23nulldata quits [Ping timeout: 252 seconds]
07:45:36nulldata (nulldata) joins
07:52:01miki_57 joins
07:52:32miki_57 quits [Max SendQ exceeded]
07:52:35miki_57 joins
07:53:06miki_57 quits [Max SendQ exceeded]
07:53:09miki_57 joins
07:53:40miki_57 quits [Max SendQ exceeded]
07:53:43miki_57 joins
07:54:14miki_57 quits [Max SendQ exceeded]
07:54:16miki_57 joins
07:54:48miki_57 quits [Max SendQ exceeded]
07:54:51miki_57 joins
07:55:22miki_57 quits [Max SendQ exceeded]
07:55:25miki_57 joins
07:55:56miki_57 quits [Max SendQ exceeded]
07:55:58miki_57 joins
07:56:30miki_57 quits [Max SendQ exceeded]
07:56:32miki_57 joins
07:57:04miki_57 quits [Max SendQ exceeded]
07:57:07miki_57 joins
07:57:38miki_57 quits [Max SendQ exceeded]
07:57:40miki_57 joins
07:58:12miki_57 quits [Max SendQ exceeded]
07:58:15miki_57 joins
07:58:46miki_57 quits [Max SendQ exceeded]
07:58:49miki_57 joins
07:59:20miki_57 quits [Max SendQ exceeded]
07:59:22miki_57 joins
07:59:54miki_57 quits [Max SendQ exceeded]
07:59:57miki_57 joins
08:00:28miki_57 quits [Max SendQ exceeded]
08:00:31miki_57 joins
08:01:02miki_57 quits [Max SendQ exceeded]
08:01:05miki_57 joins
08:01:36miki_57 quits [Max SendQ exceeded]
08:01:39miki_57 joins
08:02:10miki_57 quits [Max SendQ exceeded]
08:02:13miki_57 joins
08:02:44miki_57 quits [Max SendQ exceeded]
08:02:47miki_57 joins
08:03:18miki_57 quits [Max SendQ exceeded]
08:03:21miki_57 joins
08:03:50nulldata quits [Ping timeout: 252 seconds]
08:03:52miki_57 quits [Max SendQ exceeded]
08:03:55miki_57 joins
08:04:18<erkinalp>JAA: wowturkey definitively down, as of 0400UTC today
08:04:26miki_57 quits [Max SendQ exceeded]
08:04:29miki_57 joins
08:05:00miki_57 quits [Max SendQ exceeded]
08:05:02miki_57 joins
08:05:34miki_57 quits [Max SendQ exceeded]
08:05:37miki_57 joins
08:06:08miki_57 quits [Max SendQ exceeded]
08:06:11miki_57 joins
08:06:37nulldata (nulldata) joins
08:06:43miki_57 quits [Max SendQ exceeded]
08:06:45miki_57 joins
08:07:16miki_57 quits [Max SendQ exceeded]
08:07:19miki_57 joins
08:07:50miki_57 quits [Max SendQ exceeded]
08:07:52miki_57 joins
08:08:24miki_57 quits [Max SendQ exceeded]
08:08:27miki_57 joins
08:08:58miki_57 quits [Max SendQ exceeded]
08:09:01miki_57 joins
08:09:32miki_57 quits [Max SendQ exceeded]
08:09:34miki_57 joins
08:10:06miki_57 quits [Max SendQ exceeded]
08:10:09miki_57 joins
08:10:31<thuban>AntoninDelFabbro|m: that's a good way to capture data you can click through manually, but if the amount of navigation required is very large, i personally prefer to write a short script.
08:10:40miki_57 quits [Max SendQ exceeded]
08:10:43miki_57 joins
08:11:14miki_57 quits [Max SendQ exceeded]
08:11:17miki_57 joins
08:11:36<thuban>i have done so for annuaire-pp.orange.fr (and in the process, i believe, discovered more results than are shown in the browser) and will dump results tomorrow
08:11:48miki_57 quits [Max SendQ exceeded]
08:11:51miki_57 joins
08:12:22miki_57 quits [Max SendQ exceeded]
08:12:25miki_57 joins
08:12:56miki_57 quits [Max SendQ exceeded]
08:12:59miki_57 joins
08:13:30miki_57 quits [Max SendQ exceeded]
08:13:33miki_57 joins
08:14:04miki_57 quits [Max SendQ exceeded]
08:14:06miki_57 joins
08:14:38miki_57 quits [Max SendQ exceeded]
08:14:41miki_57 joins
08:15:12miki_57 quits [Max SendQ exceeded]
08:15:14miki_57 joins
08:15:46miki_57 quits [Max SendQ exceeded]
08:15:49miki_57 joins
08:16:16MetaNova quits [Ping timeout: 265 seconds]
08:16:20miki_57 quits [Max SendQ exceeded]
08:16:33miki_57 joins
08:17:05miki_57 quits [Max SendQ exceeded]
08:17:07miki_57 joins
08:17:39miki_57 quits [Max SendQ exceeded]
08:17:42miki_57 joins
08:18:13miki_57 quits [Max SendQ exceeded]
08:18:16miki_57 joins
08:18:47miki_57 quits [Max SendQ exceeded]
08:18:50miki_57 joins
08:19:21miki_57 quits [Max SendQ exceeded]
08:19:24miki_57 joins
08:19:55miki_57 quits [Max SendQ exceeded]
08:19:57miki_57 joins
08:20:29miki_57 quits [Max SendQ exceeded]
08:20:32miki_57 joins
08:20:56<AntoninDelFabbro|m>Awesome! Haha, well you saved me lot of time, thanks ;)
08:21:03miki_57 quits [Max SendQ exceeded]
08:21:06miki_57 joins
08:21:37miki_57 quits [Max SendQ exceeded]
08:21:40miki_57 joins
08:21:56MetaNova (MetaNova) joins
08:22:11miki_57 quits [Max SendQ exceeded]
08:22:14miki_57 joins
08:22:45miki_57 quits [Max SendQ exceeded]
08:22:47miki_57 joins
08:22:57Island quits [Read error: Connection reset by peer]
08:23:19miki_57 quits [Max SendQ exceeded]
08:23:22miki_57 joins
08:23:53miki_57 quits [Max SendQ exceeded]
08:23:55miki_57 joins
08:24:27miki_57 quits [Max SendQ exceeded]
08:24:30miki_57 joins
08:25:01miki_57 quits [Max SendQ exceeded]
08:25:04miki_57 joins
08:25:35miki_57 quits [Max SendQ exceeded]
08:25:37miki_57 joins
08:26:09miki_57 quits [Max SendQ exceeded]
08:26:10<thuban>AntoninDelFabbro|m: you're welcome!
08:26:12miki_57 joins
08:26:43miki_57 quits [Max SendQ exceeded]
08:26:46miki_57 joins
08:27:17miki_57 quits [Max SendQ exceeded]
08:27:20miki_57 joins
08:27:51miki_57 quits [Max SendQ exceeded]
08:27:54miki_57 joins
08:28:25miki_57 quits [Max SendQ exceeded]
08:28:28miki_57 joins
08:28:59miki_57 quits [Max SendQ exceeded]
08:29:02miki_57 joins
08:29:33miki_57 quits [Max SendQ exceeded]
08:29:36miki_57 joins
08:30:07miki_57 quits [Max SendQ exceeded]
08:30:10miki_57 joins
08:30:42miki_57 quits [Max SendQ exceeded]
08:30:45miki_57 joins
08:31:16miki_57 quits [Max SendQ exceeded]
08:31:19miki_57 joins
08:31:50miki_57 quits [Max SendQ exceeded]
08:31:53miki_57 joins
08:32:08<thuban>also, uh, can someone remind me what the status is on orange isp hosting in general? are we still just dumping stuff in archivebot? because there are tens of thousands of these
08:32:24miki_57 quits [Max SendQ exceeded]
08:32:27miki_57 joins
08:32:58miki_57 quits [Max SendQ exceeded]
08:33:01miki_57 joins
08:33:32miki_57 quits [Max SendQ exceeded]
08:33:35miki_57 joins
08:34:19<thuban>(i was going to add 'and some of them require javascript', but based on my spot-checking they're all in the weird 'put everything in the html, but don't actually display it until the js loads' idiom, so i think archivebot would actually be fine in that respect)
08:36:01aGerman quits [Quit: The Lounge - https://thelounge.chat]
08:39:13aGerman (aGerman) joins
08:47:26<erkinalp>qyxojzh|m: JAA: arkiver: one of the former wowturkey's mods are about to ask the owner to buy and resurrect wowturkey.com
08:47:37<erkinalp>we might get a last chance revive
08:58:11<flashfire42>thuban um everything into archivebot unless you design some scripts because its about a week away from going bye bye and we have like 3 or 4 warrior projects on the go and fuck all ingestion to IA right now
09:09:02Exorcism (exorcism) joins
09:26:09erkinalp quits [Remote host closed the connection]
09:28:11erkinalp joins
09:44:11<h2ibot>Bzc6p edited Demotivalo.net (+37, /* Sister sites */ Update stati): https://wiki.archiveteam.org/?diff=50595&oldid=47826
10:00:01railen63 quits [Remote host closed the connection]
10:00:18railen63 joins
10:03:27Exorcism quits [Client Quit]
10:04:42Exorcism (exorcism) joins
10:18:01jacksonchen666 is now known as RJHacker92085
10:18:05jacksonchen666 (jacksonchen666) joins
10:20:51RJHacker92085 quits [Ping timeout: 245 seconds]
10:31:58BlueMaxima quits [Read error: Connection reset by peer]
10:46:05systwi quits [Ping timeout: 252 seconds]
10:52:33systwi (systwi) joins
11:25:14<erkinalp>wowturkey is down as of now
11:33:44railen63 quits [Remote host closed the connection]
11:36:59railen63 joins
12:04:36BigBrain quits [Ping timeout: 245 seconds]
12:06:52BigBrain (bigbrain) joins
12:19:27<pabs>this site has 6TB of FLACs for bluegrass music: https://bluegrassarchive.com/ (frameset for https://gdarchive.net/Public/Bluegrass/contents.htm)
12:20:21<pabs>would be nice to grab eventually, but seems a bit big for AB, especially with the current IA upload limits
12:32:43<h2ibot>PaulWise edited Bugzilla (+1073, more from BZ site…): https://wiki.archiveteam.org/?diff=50596&oldid=50588
12:33:44<h2ibot>PaulWise edited Bugzilla (-93, remove accidentally added done ones): https://wiki.archiveteam.org/?diff=50597&oldid=50596
12:43:11hitgrr8 quits [Client Quit]
12:43:45<h2ibot>PaulWise edited Deathwatch (+295, Eclipse Wiki shutdown): https://wiki.archiveteam.org/?diff=50598&oldid=50584
12:53:48<h2ibot>PaulWise edited Bugzilla (+0, Eclipse Bugzilla shutdown, AB in progress:…): https://wiki.archiveteam.org/?diff=50599&oldid=50597
13:03:41systwi quits [Client Quit]
13:05:41systwi (systwi) joins
13:09:22jacksonchen666 quits [Client Quit]
13:33:27thenes quits [Quit: WeeChat 4.0.0]
13:34:56<h2ibot>Exorcism edited Orain (-6): https://wiki.archiveteam.org/?diff=50600&oldid=44412
13:46:06<@arkiver>pabs: it's maybe fine to put in AB when the current problems at IA are fixed
13:46:31<pabs>ok, wasn't sure if AB could handle that volume either
13:46:44<pabs>thanks
13:48:26<@arkiver>well JAA is the expert on that
13:50:15dazld quits [Ping timeout: 265 seconds]
13:55:00<h2ibot>Exorcism edited Nupedia (+10): https://wiki.archiveteam.org/?diff=50601&oldid=28721
14:01:20Arcorann quits [Ping timeout: 252 seconds]
14:03:49dazld (dazld) joins
14:05:33<@JAA>erkinalp: Ugh. Yeah, let's hope it's resurrected.
14:07:43<@JAA>pabs, arkiver: AB doesn't care much about data size as long as there aren't huge files in it. The other limiting factor is number of URLs, but until you go over 100M, that's not usually a problem either.
14:23:02ymgve joins
14:28:10<erkinalp>JAA: seems no hope
14:50:13mgrytbak quits [Quit: Ping timeout (120 seconds)]
14:50:28mgrytbak joins
14:51:51<erkinalp>why are archivebot downloads so slow?
14:52:21<@JAA>Do you mean downloads of ArchiveBot data from the Internet Archive?
15:15:17<erkinalp>no, arcihvebot data downloads from archive.fart.website
15:15:40<erkinalp>i'm getting isdn speeds of download currently
15:16:02<erkinalp>it can't be due to my link speeds either
15:16:15<erkinalp>i hapen to have 70mbps down, 10mpbs up
15:16:30<@JAA>The AB viewer is just an index of the data on IA.
15:16:38<@JAA>The links go to IA.
15:17:04<@JAA>And yeah, downloads from IA are notoriously slow, especially if you aren't near the Bay Area.
15:18:08<erkinalp>it isn't that slow normally
15:18:21<erkinalp>it was usually a few mbp
15:18:24<erkinalp>s
15:18:40<erkinalp>i could get good dsl speeds of download
15:18:47<erkinalp>not dialup or isdn speeds
15:19:17mgrytbak quits [Client Quit]
15:19:27mgrytbak joins
15:21:04<@JAA>It varies depending on IA load and which server the data is on.
15:23:03dazld quits [Ping timeout: 265 seconds]
15:25:08DogsRNice joins
15:31:15Dango360 quits [Read error: Connection reset by peer]
15:34:03Dango360 (Dango360) joins
15:37:36dazld (dazld) joins
15:44:52mgrytbak quits [Client Quit]
15:45:01mgrytbak joins
15:49:02mgrytbak quits [Client Quit]
15:49:12mgrytbak joins
16:03:15petrichor quits [Client Quit]
16:04:15petrichor (petrichor) joins
16:25:24lflare quits [Quit: Bye]
16:43:27dumbgoy_ joins
17:24:38aninternettroll_ (aninternettroll) joins
17:24:46aninternettroll quits [Read error: Connection reset by peer]
17:25:02aninternettroll_ is now known as aninternettroll
17:27:41<erkinalp>IA downloads are now down to dialup speeds
17:30:31<@JAA>Not surprising. IA is pretty busy recently, and it slows various things to a crawl.
17:31:32<@JAA>Even IA-internal things are slow. One particular item I was monitoring took over 6 days to move 43 GB around internally.
17:31:43<fireonlive>ooof
17:32:03<@JAA>(Move it from S3 to item server, checksums, and mirror to backup server.)
17:40:57erkinalp quits [Remote host closed the connection]
17:41:48erkinalp joins
17:41:59<@arkiver>i'm planning on dusting off wikis-grab for the upcoming deletions of wikis
17:42:09<@arkiver>though those are also largely covered already by wikiteam dumps i believe
17:42:21<@arkiver>the wikis-grab would more be a general method of archiving wikis
17:43:00<@arkiver>project is also coming for ZOWA
17:44:36<@arkiver>any idea for a channel for zowa.app ?
17:45:05<@arkiver>flashfire42: do you know if we have the orange ISP hosting stuff fully covered with AB?
17:48:43<pokechu22>arkiver: have you seen #wikibot?
17:48:58<pokechu22>ah, you're already in that channel
17:49:11Darken (Darken) joins
17:49:25<@arkiver>pokechu22: yeah
17:49:42<@arkiver>i think it's good to have both dumps from there and from a project creating WARCs
17:49:45<pokechu22>I haven't seen wikis-grab before - does it try to do everything wikiteam does, or is it mostly focused on saving the current revision of every page?
17:49:53<pokechu22>Yeah, WARCs are good
17:50:03<@arkiver>current revision
17:50:21<@arkiver>may be good to allow wikiteam higher priority with the dumps than WARCs, since it's more complete
17:50:36<@arkiver>but after that we should attempt to create WARCs as well
17:51:03<pokechu22>Saving the current revision and maybe all pages on the history tab (but not the revisions themselves - just the history list for attribution) probably is enough for WARCs
17:51:36<@arkiver>yeah
17:51:52<@JAA>Love it!
17:51:55<@arkiver>it's largely for URL preservation, so it's in the wayback machine and easily browsable
17:52:12<@arkiver>the dumps can be used to restore a wiki with (right?), but for browsing WARCs are better
17:52:14<@arkiver>JAA: :)
17:53:28<@JAA>Yep, that's accurate.
17:53:44<@JAA>The dumps are entirely unusable for the average person.
17:54:10<@arkiver>and of course outlinks for #// :)
17:55:16<nstrom|m><arkiver> "any idea for a channel for zowa..." <- zowch
17:55:57<flashfire42|m>Soddy arkiver it’s 4am here and you are lucky I snap awake for random reasons. Orange is far from complete in archivebot. I’ve been launching as many jobs as I can manage but it’s like fighting a fire with a kids water bucket. I’ll get a sampling but not all of it
17:56:50<flashfire42|m>I will continue to throw in a much as possible during the next week but we won’t get it all I can say that with certainty. Not unless we get a stay of execution for another month or 2
17:57:18<flashfire42|m>Hopefully that info is helpful it is time for me to head back to sleep for another 2 hours.
17:58:22<pokechu22>I'll try to do an !a < list job for it too
17:58:32<pokechu22>The deadline for webs is sooner though
17:58:41<erkinalp>JAA: unless they have a WARC reader,
17:59:22<@JAA>z-oww-a
17:59:26CandidSparrow quits [Quit: Peace Out]
18:00:24CandidSparrow joins
18:00:49<imer>#nowa
18:01:21<imer>although I like the oww one better :D
18:08:49<@JAA>Or perhaps some play on the content. What are some sounds you'd absolutely not want to hear in an ASMR video?
18:09:00<nstrom|m>zowaah, zowie 🤷‍♂️
18:10:51<imer>#zo🍽️
18:14:28<@JAA>My terminal is sad about that last one.
18:17:13<imer>Yeah, lets not :)
18:32:26<@arkiver>:)
18:43:41Island joins
18:53:53<thuban>arkiver, re orange isp hosting: AntoninDelFabbro|m posted a link to a page listing sites, and i have been enumerating them using its api
18:55:46<thuban>if my suspicion that supplying 0 as the category id retrieves all categories is correct, i expect to be able to enumerate 159832 sites (some fraction of which will be duplicates or inaccessible due to various oddnesses)
18:57:20<thuban>i don't think it's realistic to do this 'manually', but maybe some `!a <` jobs? individual sites are quite small as a rule
19:03:36<AntoninDelFabbro|m>I'm really impressed and thankful
19:10:26<fireonlive>i for one vote for emoji channel ;)
19:10:43<fireonlive>:3
19:11:29<erkinalp>JAA: wowturkey definitively dead, we can update the deadwatch now (death date: 2023-08-27,0400Z)
19:11:58<erkinalp>s/deadwatch/deathwatch/
19:11:59<fireonlive>oh it is confirmed by owner?
19:12:25<erkinalp>owner not responding to any correspondence
19:12:32<erkinalp>no hope of coming back up again
19:12:53<erkinalp>the AB job has a few external links (~650 or so) pending
19:12:56<fireonlive>ah :(
19:14:11<erkinalp>to skip wowturkey.com without impacting the remaining ~650 external resources, i'd propose to temporarily map wowturkey.com to 0.0.0.0 ;[
19:14:15<@JAA>:-(
19:14:44<erkinalp>in the bot's end i mean
19:15:26<@JAA>The AB job is paused, and the offsite URLs aren't in danger, so we can let it sit until the true deadline just in case it comes back.
19:15:50<erkinalp>oh, i though it was looping over and ober
19:15:56<erkinalp>good that it's paused
19:16:49<erkinalp>if it doesn't come back up until 23 september, then it's DaaD
19:17:15<erkinalp>(23 september is when the hosting expires, exactly 22 years from the website's start)
19:17:35yts98 leaves
19:17:51yts98 joins
19:17:52<erkinalp>and the shutdown was exactly 20 years and 1 day from the first turkish language post
19:18:21<erkinalp>wowturkey initially consisted of english threads
19:18:27null joins
19:18:31<erkinalp>promoting turkey to outsiders
19:21:34<pokechu22>https://transfer.archivete.am/inline/eDzUk/monsite-orange.fr_seed_urls.txt - this is the smaller one of the two :|
19:22:00<pokechu22>(this also contains urls from monsite.orange.fr and monsite.wanadoo.fr, both of which give a page redirecting (but not a 3xx redirect) to monsite-orange.fr)
19:22:18rktk quits [Ping timeout: 265 seconds]
19:24:15<thuban>pokechu22: how was that list collected?
19:24:35<@JAA>erkinalp: Well, to be precise, 'paused' here just means a very slow request rate (one request every five minutes in this case), not actually paused.
19:24:44<@JAA>Also, not sure where you got that 650 number from.
19:25:02<pokechu22>Most of it was https://archive.org/developers/wayback-cdx-server.html (e.g. https://web.archive.org/cdx/search/cdx?url=pagesperso-orange.fr&matchType=domain&collapse=urlkey&fl=original&limit=100000&showResumeKey=1&resumeKey=fr%2Cpagesperso-orange%2Clignerolles-allier%29%2Fcartes_postales%2Fteillet%2520argenty%2Falbum%2Fslides%2Fle%2520tumulus.html+20141112051426) - I also
19:25:04<pokechu22>mixed in a list from #webroasting a while back
19:25:16<@JAA>There are about 8.3k offsite URLs in the remaining queue.
19:25:33<pokechu22>SrainUser's https://transfer.archivete.am/Y5Qsp/orange_isp_hosting_urls.txt which I think was scraped from the list the site gives but I'm not 100% sure
19:26:00<pokechu22>and, yes, there's a fair bit of garbage on my list - easier to let it be attempted and fail than to try to filter it out
19:26:15<@arkiver>thuban: if you have a list of sites, please do post them!
19:26:27<thuban>arkiver: still processing, will do
19:26:55<pokechu22>I can deduplicate my list against anything you find and start a second job for whatever's missing
19:28:10<pokechu22>It looks like there's a pagespro-orange.fr in addition to a pagesperso-orange.fr incidentally
19:28:46<thuban>yep
19:29:37<@arkiver>thuban: thank you
19:29:52<@arkiver>and in the meantime - all those queuing AB jobs for orange, please keep doing that
19:31:30<pokechu22>I'm currently doing an !a < list AB job for it - this is easier since it's one job for thousands of sites, but it's a bit buggy in that if the sites link to eachother, it might not recurse properly. Still, it seems like the most practical way to do this
19:32:51<erkinalp>JAA: thanks for the number
19:33:24<@arkiver>erkinalp: we got a pretty serious chunk of it i believe
19:34:50<erkinalp>89% of items saved
19:35:06<erkinalp>maybe more
19:35:48<erkinalp>(the website had 9.35M posts, according to their own stats)
19:35:50<@arkiver>that's good!
19:36:06<fireonlive>:D
19:36:13<@arkiver>not sure the percentage is correct, but we got more than half i think
19:36:14<erkinalp>after scraping and reconstruction, i might actually get more posts
19:36:30<thuban>pokechu22: agreed re practicality (and i don't think sites linking to each other will be a problem--at worst, it won't get pages linked to by other sites but not their host site's homepage)
19:37:06<thuban>(and at best it will work fine, although there seems to be some confusion about this https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323962 ... https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323982)
19:38:46<thuban>i see that you have a job running for monsite-orange.fr_seed_urls.txt; are you going to start another for orange_isp_hosting_urls.txt? (or has that already been done?)
19:41:38AmAnd0A quits [Ping timeout: 265 seconds]
19:42:10AmAnd0A joins
19:43:10<pokechu22>I'm working on building my own list for the orange one based on orange_isp_hosting_urls.txt but I'm not going to run orange_isp_hosting_urls.txt directly
19:44:17<thuban>ok, cool
19:46:13<@JAA>erkinalp: Those stats are not right. You'd have to analyse the WARC data to tell how much we covered. The number of URLs retrieved is not really correlated to that in a meaningful way.
19:46:25<@JAA>We fetched offsite URLs, we fetched rating.php, and so on.
19:47:51<@JAA>A coarser estimate would be possible by analysing just the log file and retrieving how many topic IDs appear there, but later pages could be missing, so that's still only a rough estimate.
19:49:45<erkinalp>JAA: yeah that was what i was referring to by "after .. reconstruction, i might actually get more posts"
19:50:14<@JAA>Or less. We don't know how far it got through the forum pagination.
19:52:08<erkinalp>thankfully wowturkey's viewtopic.php page size is fixed at 10, and ttum.php page size is fixed at 100
19:53:51<erkinalp>and both were configured in a manner to link to the most recent pages of each respective topic
19:55:36Darken quits [Remote host closed the connection]
20:08:28jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))]
20:08:30jtagcat (jtagcat) joins
20:09:48CandidSparrow quits [Client Quit]
20:09:48IDK_ quits [Quit: Ping timeout (120 seconds)]
20:09:53Shampoo2140 quits [Remote host closed the connection]
20:10:13IDK_ joins
20:10:56CandidSparrow joins
20:11:03Shampoo2140 joins
20:14:01dazld quits [Ping timeout: 265 seconds]
20:14:16that_lurker quits [Quit: Clowning around is not the same as fooling around...I am a clown, not a fool]
20:14:24that_lurker (that_lurker) joins
20:18:24qwertyasdfuiopghjkl quits [Quit: qwertyasdfuiopghjkl]
20:23:49dazld (dazld) joins
20:41:54miki_57 quits [Client Quit]
20:44:34szczot3k (szczot3k) joins
20:44:57<szczot3k>Hi, how can I help with the efforts? Is an v6-space-holder useful anyhow?
20:48:12<@rewby>IIRC there's at least one project currently active that uses a ton of v6 ips
20:48:20<@rewby>I forget which one, imer wrote the code for it
20:48:23<@rewby>Or rather
20:48:30<@rewby>imer wrote the deployment code that makes it do lots of v6
20:48:50<szczot3k>Well, I can technically use a whole /39, so ready to help out
20:49:10<imer>#deadcat but we're bandwidth limited there anyways (and they seem to rate limit per single ip or something like that)
20:49:12<DigitalDragons>#deadcat is the one
20:49:27<imer>target bandwidth limited that is
20:49:56<@rewby>Also, I could see if I can hook up my spare /32 to some workers at some point
20:50:39<@rewby>I'm just busy with targets
20:50:39<imer>szczot3k: here's the aforementioned code/script as well: https://gist.github.com/imerr/614e534218a6b93be1a40b088dee885a
20:50:51<DigitalDragons>i heard #sweet supports ipv6 too but I don't know about their ratelimiting
20:51:26<imer>there is none unless you go way too fast and then they will (it seems) manually block your ip
20:51:54<DigitalDragons>hah
20:51:56<flashfire42>Ok got back through scrollback and it seems the consensus is to ignore webs for the moment and focus on orange? cc arkiver
20:52:49<DigitalDragons>also, glad to hear about wikis-grab!
20:55:09<DigitalDragons>i have some wikibot #// extraction almost ready but unsure about filtering
21:00:10Exorcism quits [Client Quit]
21:04:31miki_57 joins
21:05:03miki_57 quits [Max SendQ exceeded]
21:05:05miki_57 joins
21:21:02nexusxe (nexusxe) joins
21:23:38erkinalp quits [Ping timeout: 265 seconds]
21:37:12Larsenv7 (Larsenv) joins
21:40:03Larsenv quits [Ping timeout: 265 seconds]
21:42:28Larsenv7 quits [Ping timeout: 265 seconds]
21:47:09Larsenv7 (Larsenv) joins
21:50:56Larsenv76 (Larsenv) joins
21:53:35Larsenv7 quits [Ping timeout: 265 seconds]
21:53:35Larsenv76 is now known as Larsenv7
21:54:30Larsenv7 quits [Read error: Connection reset by peer]
21:59:46Larsenv76 (Larsenv) joins
22:02:44Larsenv76 quits [Client Quit]
22:04:12Larsenv76 (Larsenv) joins
22:05:25Larsenv76 quits [Client Quit]
22:06:26dumbgoy_ quits [Read error: Connection reset by peer]
22:06:44Larsenv76 (Larsenv) joins
22:08:09dumbgoy joins
22:08:43dumbgoy quits [Read error: Connection reset by peer]
22:09:04Larsenv76 quits [Client Quit]
22:10:41Larsenv76 (Larsenv) joins
22:16:54Larsenv76 quits [Client Quit]
22:18:33<thuban>ok, orange.fr enumeration finished and spot-checks suggest that i got all the categories
22:18:38<thuban>processing the results now
22:19:26BlueMaxima joins
22:20:11<thuban>malformed urls won't break archivebot, right? there are a few fun ones in here, like `usftennis2.monsite-orange.fr/index.html#="'><h1>abcd</h1>${{7*7}}${7*7}%{7+7}[[7*7]]@(1+2)<%= 7*7 %>` and `monsite.orange.fr@la-canaliere`
22:21:01Larsenv76 (Larsenv) joins
22:21:14Larsenv76 quits [Client Quit]
22:22:27dumbgoy joins
22:22:38Larsenv76 (Larsenv) joins
22:23:40Larsenv76 quits [Client Quit]
22:23:40<pokechu22>Right
22:23:56<pokechu22>the first one would just be treated as usftennis2.monsite-orange.fr/index.html because of the #
22:24:24<pokechu22>the second one would be probably treated as trying to log in as user monsite.orange.fr on site http://la-canaliere which obviously won't work, but will fail in an acceptable way
22:24:35<h2ibot>FireonLive edited Deathwatch (+294, move wowTURKEY to dead (we should use that…): https://wiki.archiveteam.org/?diff=50602&oldid=50598
22:24:44<pokechu22>the main thing that breaks archivebot is FTP - there are a few other things that can cause problems but they aren't easy to control for
22:24:50Larsenv76 (Larsenv) joins
22:24:56Larsenv76 quits [Client Quit]
22:29:22Larsenv76 (Larsenv) joins
22:29:26Larsenv76 quits [Client Quit]
22:33:37<h2ibot>FireonLive edited Deathwatch (+2, fix url for 2028-Russia going to example.com…): https://wiki.archiveteam.org/?diff=50603&oldid=50602
22:33:41<fireonlive>(i was like example.com?!)
22:34:43<@JAA>That mistake is so common.
22:35:07<@JAA>I wish there was a way to make edits throw an error when a template isn't used correctly.
22:35:24<@JAA>Probably possible with an extension or something ridiculous like that.
22:37:27<pokechu22>You could probably use an editfilter
22:37:48<pokechu22>er, for that one, probably the right thing to do is make it generate a big red message of anger instead of silently using example.com
22:38:20<@JAA>We do have https://wiki.archiveteam.org/index.php/Category:Pages_with_broken_URLs for all uses of Template:URL where the URL is empty.
22:38:40<@JAA>I just remembered that I added that at one point.
22:40:40<pokechu22>https://en.wikipedia.org/wiki/Module:Check_for_unknown_parameters exists but I don't think lua is enabled on the AT wiki
22:41:02Larsenv76 (Larsenv) joins
22:42:38<h2ibot>Pokechu22 edited Template:Url (+138, add visible warning about broken URLs): https://wiki.archiveteam.org/?diff=50604&oldid=49244
22:42:39<h2ibot>Pokechu22 edited Reddit (-1, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50605&oldid=49987
22:43:38<h2ibot>Pokechu22 edited Talk:Twitter (+2, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50606&oldid=49771
22:44:38<h2ibot>Pokechu22 created Category:Pages with broken URLs (+210, Created page with "Pages that use…): https://wiki.archiveteam.org/?title=Category%3APages%20with%20broken%20URLs
22:45:32<@JAA>Good idea, thanks.
22:45:58lukash9 joins
22:46:37<@JAA>Should be good enough.
22:46:38h3ndr1k quits [Quit: ]
22:46:59h3ndr1k (h3ndr1k) joins
22:56:18<fireonlive>awesome ^_^
23:15:42railen63 quits [Remote host closed the connection]
23:19:12railen63 joins
23:22:02c joins
23:22:30c quits [Remote host closed the connection]
23:31:46qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
23:54:56imer4 (imer) joins
23:58:38imer quits [Ping timeout: 252 seconds]
23:59:12imer (imer) joins