| 00:27:29 | | jacksonchen666 is now authenticated as * |
| 00:27:29 | | jacksonchen666 is now known as RJHacker1205 |
| 00:27:33 | | jacksonchen666 (jacksonchen666) joins |
| 00:30:26 | | RJHacker1205 quits [Ping timeout: 245 seconds] |
| 00:48:54 | | Arcorann (Arcorann) joins |
| 01:15:57 | <pabs> | AntoninDelFabbro|m: which website, what are you trying to get? I tend to use different things for different purposes. for eg: googler/ddgr for site: search engine queries. curl/wget for downloads. pup for HTML parsing/querying. jq for JSON querying |
| 01:42:21 | | Arcorann quits [Ping timeout: 265 seconds] |
| 01:44:42 | | Arcorann (Arcorann) joins |
| 01:44:53 | | Megame quits [Ping timeout: 252 seconds] |
| 02:05:49 | | erkinalp joins |
| 02:06:57 | | erkinalp quits [Remote host closed the connection] |
| 02:19:39 | <h2ibot> | Systwi uploaded File:Duck Hunt (World)-0--twitter-5.png (Mr. Peepers holding the Twitter bird,…): https://wiki.archiveteam.org/?title=File%3ADuck%20Hunt%20%28World%29-0--twitter-5.png |
| 02:27:41 | <h2ibot> | Systwi edited Twitter (+177, /* Vital Signs */ Added meme and serious caption.): https://wiki.archiveteam.org/?diff=50593&oldid=50591 |
| 02:35:29 | <fireonlive> | systwi: 😃 |
| 02:39:43 | <h2ibot> | Systwi edited Site exploration (+398, /* Twitter */ Mentioned Nitter and Twitter's…): https://wiki.archiveteam.org/?diff=50594&oldid=50492 |
| 02:40:02 | <systwi> | fireonlive: :-D |
| 03:15:20 | | Megame (Megame) joins |
| 03:15:40 | | parfait (kdqep) joins |
| 03:26:14 | | mindstrut quits [Read error: Connection reset by peer] |
| 03:26:41 | | mindstrut joins |
| 03:51:49 | | krvme joins |
| 03:52:01 | | fluke quits [Ping timeout: 258 seconds] |
| 03:52:12 | | fluke joins |
| 03:55:45 | | Krume quits [Ping timeout: 265 seconds] |
| 04:17:31 | | DogsRNice quits [Read error: Connection reset by peer] |
| 04:20:53 | | dazld quits [Ping timeout: 265 seconds] |
| 04:44:41 | | erkinalp joins |
| 04:54:05 | <erkinalp> | wowturkey is down, not known it's temporary and re-restored or permanent as in finally closed |
| 04:54:40 | | dazld (dazld) joins |
| 04:54:48 | <erkinalp> | let's leave the bot running in hope it returns once more |
| 04:55:02 | <erkinalp> | the announced date was august 31 |
| 05:01:28 | | IDK (IDK) joins |
| 05:02:15 | <pabs> | nicolas17: anything to save for this? https://www.volkerkrause.eu/2023/08/26/kde-jenkins-retirement-progress.html |
| 05:03:50 | <nicolas17> | pabs: I doubt it because most data there was ephemeral in the first place, eg. there's projects that do a daily build and only the last 5 binaries are kept |
| 05:04:02 | | BigBrain quits [Remote host closed the connection] |
| 05:04:04 | <pabs> | and what about the phabricator? |
| 05:04:23 | <erkinalp> | phabricator is a bugtracker, that's significant |
| 05:04:24 | | BigBrain (bigbrain) joins |
| 05:04:41 | <erkinalp> | tickets may have good things |
| 05:05:34 | <nicolas17> | we'll probably turn it into static pages somehow |
| 05:05:55 | <nicolas17> | I'm not sure how easy it is to archive, I think there's like, JS-backed "load more comments" stuff? |
| 05:05:57 | | pabs recommends an AB job, then download the static files :) |
| 05:06:30 | <pabs> | I did a phabricator recently, apart from the large amount of ignores I think it worked ok |
| 05:07:38 | <erkinalp> | we'd have to do the same with missing wowturkey viewtopic pages with p=### links |
| 05:08:06 | <erkinalp> | corresponding t=####&start=### ones have already been crawled |
| 05:08:27 | <nicolas17> | at one point we considered moving issues from phabricator to gitlab and it was messy because tickets can have multiple tags/projects that they belong to, while gitlab issues belong to *one* project |
| 05:09:15 | <nicolas17> | so we would need to check case by case and make a list of "if a ticket has tag X and tag Y, put it in repo Y" |
| 05:09:58 | <pabs> | AB job then static seems better |
| 05:10:18 | <nicolas17> | well yeah, this was *early* in the gitlab move when a lot of tickets would still be active |
| 05:10:25 | | BigBrain quits [Remote host closed the connection] |
| 05:11:38 | <nicolas17> | by now I guess a lot was closed, or stopped mattering, or was still active and someone moved it manually |
| 05:12:31 | <flashfire42> | Ok should I focus on webs or orange today? both have close cut off dates. Or do I say the hell with the both of them and continue with the aussie ISPs that have technically passed their shutdown date and are still up? |
| 05:14:39 | <erkinalp> | "September 1: wowTURKEY[IA•Wcite•.today•MemWeb], a large Turkish photo sharing forum[23]" september 1 → august 27 |
| 05:15:00 | <erkinalp> | don't kill the archivebot crawler tho |
| 05:15:12 | <erkinalp> | it still crawls previously failed external links |
| 05:17:18 | <erkinalp> | if we had started this crawl one day before, we would have the full archive today... |
| 05:17:44 | | dumbgoy_ quits [Ping timeout: 252 seconds] |
| 05:17:45 | <flashfire42> | Alas that is the joys of web archival |
| 05:17:50 | <flashfire42> | things are lost every day my friend |
| 05:18:01 | | hitgrr8 joins |
| 05:18:07 | <flashfire42> | and it sucks. it does. but we do what we can |
| 05:18:15 | <erkinalp> | 87% is better than nothing |
| 05:18:38 | <erkinalp> | (known item count is ~9.4M, we got 8,338,246) |
| 05:23:32 | <pabs> | https://www.science.org/content/article/government-seizure-nicaraguan-university-blow-science-researchers-say |
| 05:23:49 | <erkinalp> | 300ms is too short for wowturkey, the server's own delay was about 500ms even when it was up |
| 05:26:02 | | BigBrain (bigbrain) joins |
| 05:36:25 | <pokechu22> | flashfire42: I should probably do an !a < list job on orange - webs might be better to focus on. On the other hand webs has the stupid calendars that make things a mess :| |
| 05:37:10 | <flashfire42> | orange has about 4 different subdomains and some of them dont even resolve for me but do for others. Webs is not a set and forget thing which is really what I am aiming for because the calenders are so fucking broken |
| 05:38:46 | | sec^nd quits [Ping timeout: 245 seconds] |
| 05:39:20 | | sec^nd (second) joins |
| 06:04:48 | | dazld quits [Ping timeout: 265 seconds] |
| 06:06:42 | | dazld (dazld) joins |
| 06:15:01 | | sec^nd quits [Ping timeout: 245 seconds] |
| 06:28:06 | | sec^nd (second) joins |
| 06:43:31 | <pokechu22> | Can I get a list of those different subdomains? (Lists of individual sites would be useful too but I have some ideas of how to get those once I know the starting points) |
| 06:46:51 | <flashfire42> | pagesperso-orange.fr |
| 06:46:51 | <flashfire42> | monsite-orange.fr |
| 06:46:53 | <flashfire42> | those are the main 2 |
| 06:47:06 | <flashfire42> | images are hosted on cdn.woopic.com |
| 06:50:26 | | krvme quits [Read error: Connection reset by peer] |
| 06:56:56 | <DigitalDragons> | a few individual sites: https://crawlyproject.digitaldragon.dev/cds/lists/fr/pagesperso-orange/ |
| 07:04:08 | <DigitalDragons> | (ignore the .txt at the end of everything) |
| 07:09:23 | | Unholy2361316618085 quits [Ping timeout: 252 seconds] |
| 07:09:33 | <AntoninDelFabbro|m> | pabs: I wqnt to download https://annuaire-pp.orange.fr/accueil, but thanks for your help! :D |
| 07:11:42 | <pabs> | a good option for that is open it in your browser, open dev tools, click on all the things on the site, then save all the requests as a .har and then AB all the URLs output by this shell oneliner: |
| 07:11:44 | <pabs> | for f in *.har ; do jq -r '.log.entries[].request.url' < "$f" ; done | sort -u |
| 07:12:12 | <pabs> | ah, better open dev tools before loading the page, woops |
| 07:12:49 | <pabs> | there are some browser based crawler things on the wiki somewhere too |
| 07:13:15 | <pabs> | but they may not work if you need to interact with the site |
| 07:14:27 | | Megame quits [Client Quit] |
| 07:20:28 | <AntoninDelFabbro|m> | Gold! I just woke up, but I'm impatient to try this asap! Thank you! |
| 07:26:26 | | Perk quits [Ping timeout: 252 seconds] |
| 07:27:16 | | Perk joins |
| 07:31:24 | <erkinalp> | qyxojzh|m: wowturkey definitively down |
| 07:42:23 | | nulldata quits [Ping timeout: 252 seconds] |
| 07:45:36 | | nulldata (nulldata) joins |
| 07:52:01 | | miki_57 joins |
| 07:52:32 | | miki_57 quits [Max SendQ exceeded] |
| 07:52:35 | | miki_57 joins |
| 07:53:06 | | miki_57 quits [Max SendQ exceeded] |
| 07:53:09 | | miki_57 joins |
| 07:53:40 | | miki_57 quits [Max SendQ exceeded] |
| 07:53:43 | | miki_57 joins |
| 07:54:14 | | miki_57 quits [Max SendQ exceeded] |
| 07:54:16 | | miki_57 joins |
| 07:54:48 | | miki_57 quits [Max SendQ exceeded] |
| 07:54:51 | | miki_57 joins |
| 07:55:22 | | miki_57 quits [Max SendQ exceeded] |
| 07:55:25 | | miki_57 joins |
| 07:55:56 | | miki_57 quits [Max SendQ exceeded] |
| 07:55:58 | | miki_57 joins |
| 07:56:30 | | miki_57 quits [Max SendQ exceeded] |
| 07:56:32 | | miki_57 joins |
| 07:57:04 | | miki_57 quits [Max SendQ exceeded] |
| 07:57:07 | | miki_57 joins |
| 07:57:38 | | miki_57 quits [Max SendQ exceeded] |
| 07:57:40 | | miki_57 joins |
| 07:58:12 | | miki_57 quits [Max SendQ exceeded] |
| 07:58:15 | | miki_57 joins |
| 07:58:46 | | miki_57 quits [Max SendQ exceeded] |
| 07:58:49 | | miki_57 joins |
| 07:59:20 | | miki_57 quits [Max SendQ exceeded] |
| 07:59:22 | | miki_57 joins |
| 07:59:54 | | miki_57 quits [Max SendQ exceeded] |
| 07:59:57 | | miki_57 joins |
| 08:00:28 | | miki_57 quits [Max SendQ exceeded] |
| 08:00:31 | | miki_57 joins |
| 08:01:02 | | miki_57 quits [Max SendQ exceeded] |
| 08:01:05 | | miki_57 joins |
| 08:01:36 | | miki_57 quits [Max SendQ exceeded] |
| 08:01:39 | | miki_57 joins |
| 08:02:10 | | miki_57 quits [Max SendQ exceeded] |
| 08:02:13 | | miki_57 joins |
| 08:02:44 | | miki_57 quits [Max SendQ exceeded] |
| 08:02:47 | | miki_57 joins |
| 08:03:18 | | miki_57 quits [Max SendQ exceeded] |
| 08:03:21 | | miki_57 joins |
| 08:03:50 | | nulldata quits [Ping timeout: 252 seconds] |
| 08:03:52 | | miki_57 quits [Max SendQ exceeded] |
| 08:03:55 | | miki_57 joins |
| 08:04:18 | <erkinalp> | JAA: wowturkey definitively down, as of 0400UTC today |
| 08:04:26 | | miki_57 quits [Max SendQ exceeded] |
| 08:04:29 | | miki_57 joins |
| 08:05:00 | | miki_57 quits [Max SendQ exceeded] |
| 08:05:02 | | miki_57 joins |
| 08:05:34 | | miki_57 quits [Max SendQ exceeded] |
| 08:05:37 | | miki_57 joins |
| 08:06:08 | | miki_57 quits [Max SendQ exceeded] |
| 08:06:11 | | miki_57 joins |
| 08:06:37 | | nulldata (nulldata) joins |
| 08:06:43 | | miki_57 quits [Max SendQ exceeded] |
| 08:06:45 | | miki_57 joins |
| 08:07:16 | | miki_57 quits [Max SendQ exceeded] |
| 08:07:19 | | miki_57 joins |
| 08:07:50 | | miki_57 quits [Max SendQ exceeded] |
| 08:07:52 | | miki_57 joins |
| 08:08:24 | | miki_57 quits [Max SendQ exceeded] |
| 08:08:27 | | miki_57 joins |
| 08:08:58 | | miki_57 quits [Max SendQ exceeded] |
| 08:09:01 | | miki_57 joins |
| 08:09:32 | | miki_57 quits [Max SendQ exceeded] |
| 08:09:34 | | miki_57 joins |
| 08:10:06 | | miki_57 quits [Max SendQ exceeded] |
| 08:10:09 | | miki_57 joins |
| 08:10:31 | <thuban> | AntoninDelFabbro|m: that's a good way to capture data you can click through manually, but if the amount of navigation required is very large, i personally prefer to write a short script. |
| 08:10:40 | | miki_57 quits [Max SendQ exceeded] |
| 08:10:43 | | miki_57 joins |
| 08:11:14 | | miki_57 quits [Max SendQ exceeded] |
| 08:11:17 | | miki_57 joins |
| 08:11:36 | <thuban> | i have done so for annuaire-pp.orange.fr (and in the process, i believe, discovered more results than are shown in the browser) and will dump results tomorrow |
| 08:11:48 | | miki_57 quits [Max SendQ exceeded] |
| 08:11:51 | | miki_57 joins |
| 08:12:22 | | miki_57 quits [Max SendQ exceeded] |
| 08:12:25 | | miki_57 joins |
| 08:12:56 | | miki_57 quits [Max SendQ exceeded] |
| 08:12:59 | | miki_57 joins |
| 08:13:30 | | miki_57 quits [Max SendQ exceeded] |
| 08:13:33 | | miki_57 joins |
| 08:14:04 | | miki_57 quits [Max SendQ exceeded] |
| 08:14:06 | | miki_57 joins |
| 08:14:38 | | miki_57 quits [Max SendQ exceeded] |
| 08:14:41 | | miki_57 joins |
| 08:15:12 | | miki_57 quits [Max SendQ exceeded] |
| 08:15:14 | | miki_57 joins |
| 08:15:46 | | miki_57 quits [Max SendQ exceeded] |
| 08:15:49 | | miki_57 joins |
| 08:16:16 | | MetaNova quits [Ping timeout: 265 seconds] |
| 08:16:20 | | miki_57 quits [Max SendQ exceeded] |
| 08:16:33 | | miki_57 joins |
| 08:17:05 | | miki_57 quits [Max SendQ exceeded] |
| 08:17:07 | | miki_57 joins |
| 08:17:39 | | miki_57 quits [Max SendQ exceeded] |
| 08:17:42 | | miki_57 joins |
| 08:18:13 | | miki_57 quits [Max SendQ exceeded] |
| 08:18:16 | | miki_57 joins |
| 08:18:47 | | miki_57 quits [Max SendQ exceeded] |
| 08:18:50 | | miki_57 joins |
| 08:19:21 | | miki_57 quits [Max SendQ exceeded] |
| 08:19:24 | | miki_57 joins |
| 08:19:55 | | miki_57 quits [Max SendQ exceeded] |
| 08:19:57 | | miki_57 joins |
| 08:20:29 | | miki_57 quits [Max SendQ exceeded] |
| 08:20:32 | | miki_57 joins |
| 08:20:56 | <AntoninDelFabbro|m> | Awesome! Haha, well you saved me lot of time, thanks ;) |
| 08:21:03 | | miki_57 quits [Max SendQ exceeded] |
| 08:21:06 | | miki_57 joins |
| 08:21:37 | | miki_57 quits [Max SendQ exceeded] |
| 08:21:40 | | miki_57 joins |
| 08:21:56 | | MetaNova (MetaNova) joins |
| 08:22:11 | | miki_57 quits [Max SendQ exceeded] |
| 08:22:14 | | miki_57 joins |
| 08:22:45 | | miki_57 quits [Max SendQ exceeded] |
| 08:22:47 | | miki_57 joins |
| 08:22:57 | | Island quits [Read error: Connection reset by peer] |
| 08:23:19 | | miki_57 quits [Max SendQ exceeded] |
| 08:23:22 | | miki_57 joins |
| 08:23:53 | | miki_57 quits [Max SendQ exceeded] |
| 08:23:55 | | miki_57 joins |
| 08:24:27 | | miki_57 quits [Max SendQ exceeded] |
| 08:24:30 | | miki_57 joins |
| 08:25:01 | | miki_57 quits [Max SendQ exceeded] |
| 08:25:04 | | miki_57 joins |
| 08:25:35 | | miki_57 quits [Max SendQ exceeded] |
| 08:25:37 | | miki_57 joins |
| 08:26:09 | | miki_57 quits [Max SendQ exceeded] |
| 08:26:10 | <thuban> | AntoninDelFabbro|m: you're welcome! |
| 08:26:12 | | miki_57 joins |
| 08:26:43 | | miki_57 quits [Max SendQ exceeded] |
| 08:26:46 | | miki_57 joins |
| 08:27:17 | | miki_57 quits [Max SendQ exceeded] |
| 08:27:20 | | miki_57 joins |
| 08:27:51 | | miki_57 quits [Max SendQ exceeded] |
| 08:27:54 | | miki_57 joins |
| 08:28:25 | | miki_57 quits [Max SendQ exceeded] |
| 08:28:28 | | miki_57 joins |
| 08:28:59 | | miki_57 quits [Max SendQ exceeded] |
| 08:29:02 | | miki_57 joins |
| 08:29:33 | | miki_57 quits [Max SendQ exceeded] |
| 08:29:36 | | miki_57 joins |
| 08:30:07 | | miki_57 quits [Max SendQ exceeded] |
| 08:30:10 | | miki_57 joins |
| 08:30:42 | | miki_57 quits [Max SendQ exceeded] |
| 08:30:45 | | miki_57 joins |
| 08:31:16 | | miki_57 quits [Max SendQ exceeded] |
| 08:31:19 | | miki_57 joins |
| 08:31:50 | | miki_57 quits [Max SendQ exceeded] |
| 08:31:53 | | miki_57 joins |
| 08:32:08 | <thuban> | also, uh, can someone remind me what the status is on orange isp hosting in general? are we still just dumping stuff in archivebot? because there are tens of thousands of these |
| 08:32:24 | | miki_57 quits [Max SendQ exceeded] |
| 08:32:27 | | miki_57 joins |
| 08:32:58 | | miki_57 quits [Max SendQ exceeded] |
| 08:33:01 | | miki_57 joins |
| 08:33:32 | | miki_57 quits [Max SendQ exceeded] |
| 08:33:35 | | miki_57 joins |
| 08:34:19 | <thuban> | (i was going to add 'and some of them require javascript', but based on my spot-checking they're all in the weird 'put everything in the html, but don't actually display it until the js loads' idiom, so i think archivebot would actually be fine in that respect) |
| 08:36:01 | | aGerman quits [Quit: The Lounge - https://thelounge.chat] |
| 08:39:13 | | aGerman (aGerman) joins |
| 08:47:26 | <erkinalp> | qyxojzh|m: JAA: arkiver: one of the former wowturkey's mods are about to ask the owner to buy and resurrect wowturkey.com |
| 08:47:37 | <erkinalp> | we might get a last chance revive |
| 08:58:11 | <flashfire42> | thuban um everything into archivebot unless you design some scripts because its about a week away from going bye bye and we have like 3 or 4 warrior projects on the go and fuck all ingestion to IA right now |
| 09:09:02 | | Exorcism (exorcism) joins |
| 09:26:09 | | erkinalp quits [Remote host closed the connection] |
| 09:28:11 | | erkinalp joins |
| 09:44:11 | <h2ibot> | Bzc6p edited Demotivalo.net (+37, /* Sister sites */ Update stati): https://wiki.archiveteam.org/?diff=50595&oldid=47826 |
| 10:00:01 | | railen63 quits [Remote host closed the connection] |
| 10:00:18 | | railen63 joins |
| 10:03:27 | | Exorcism quits [Client Quit] |
| 10:04:42 | | Exorcism (exorcism) joins |
| 10:18:01 | | jacksonchen666 is now authenticated as * |
| 10:18:01 | | jacksonchen666 is now known as RJHacker92085 |
| 10:18:05 | | jacksonchen666 (jacksonchen666) joins |
| 10:20:51 | | RJHacker92085 quits [Ping timeout: 245 seconds] |
| 10:31:58 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 10:46:05 | | systwi quits [Ping timeout: 252 seconds] |
| 10:52:33 | | systwi (systwi) joins |
| 11:25:14 | <erkinalp> | wowturkey is down as of now |
| 11:33:44 | | railen63 quits [Remote host closed the connection] |
| 11:36:59 | | railen63 joins |
| 12:04:36 | | BigBrain quits [Ping timeout: 245 seconds] |
| 12:06:52 | | BigBrain (bigbrain) joins |
| 12:19:27 | <pabs> | this site has 6TB of FLACs for bluegrass music: https://bluegrassarchive.com/ (frameset for https://gdarchive.net/Public/Bluegrass/contents.htm) |
| 12:20:21 | <pabs> | would be nice to grab eventually, but seems a bit big for AB, especially with the current IA upload limits |
| 12:32:43 | <h2ibot> | PaulWise edited Bugzilla (+1073, more from BZ site…): https://wiki.archiveteam.org/?diff=50596&oldid=50588 |
| 12:33:44 | <h2ibot> | PaulWise edited Bugzilla (-93, remove accidentally added done ones): https://wiki.archiveteam.org/?diff=50597&oldid=50596 |
| 12:43:11 | | hitgrr8 quits [Client Quit] |
| 12:43:45 | <h2ibot> | PaulWise edited Deathwatch (+295, Eclipse Wiki shutdown): https://wiki.archiveteam.org/?diff=50598&oldid=50584 |
| 12:53:48 | <h2ibot> | PaulWise edited Bugzilla (+0, Eclipse Bugzilla shutdown, AB in progress:…): https://wiki.archiveteam.org/?diff=50599&oldid=50597 |
| 13:03:41 | | systwi quits [Client Quit] |
| 13:05:41 | | systwi (systwi) joins |
| 13:09:22 | | jacksonchen666 quits [Client Quit] |
| 13:33:27 | | thenes quits [Quit: WeeChat 4.0.0] |
| 13:34:56 | <h2ibot> | Exorcism edited Orain (-6): https://wiki.archiveteam.org/?diff=50600&oldid=44412 |
| 13:46:06 | <@arkiver> | pabs: it's maybe fine to put in AB when the current problems at IA are fixed |
| 13:46:31 | <pabs> | ok, wasn't sure if AB could handle that volume either |
| 13:46:44 | <pabs> | thanks |
| 13:48:26 | <@arkiver> | well JAA is the expert on that |
| 13:50:15 | | dazld quits [Ping timeout: 265 seconds] |
| 13:55:00 | <h2ibot> | Exorcism edited Nupedia (+10): https://wiki.archiveteam.org/?diff=50601&oldid=28721 |
| 14:01:20 | | Arcorann quits [Ping timeout: 252 seconds] |
| 14:03:49 | | dazld (dazld) joins |
| 14:05:33 | <@JAA> | erkinalp: Ugh. Yeah, let's hope it's resurrected. |
| 14:07:43 | <@JAA> | pabs, arkiver: AB doesn't care much about data size as long as there aren't huge files in it. The other limiting factor is number of URLs, but until you go over 100M, that's not usually a problem either. |
| 14:23:02 | | ymgve joins |
| 14:28:10 | <erkinalp> | JAA: seems no hope |
| 14:50:13 | | mgrytbak quits [Quit: Ping timeout (120 seconds)] |
| 14:50:28 | | mgrytbak joins |
| 14:51:51 | <erkinalp> | why are archivebot downloads so slow? |
| 14:52:21 | <@JAA> | Do you mean downloads of ArchiveBot data from the Internet Archive? |
| 15:15:17 | <erkinalp> | no, arcihvebot data downloads from archive.fart.website |
| 15:15:40 | <erkinalp> | i'm getting isdn speeds of download currently |
| 15:16:02 | <erkinalp> | it can't be due to my link speeds either |
| 15:16:15 | <erkinalp> | i hapen to have 70mbps down, 10mpbs up |
| 15:16:30 | <@JAA> | The AB viewer is just an index of the data on IA. |
| 15:16:38 | <@JAA> | The links go to IA. |
| 15:17:04 | <@JAA> | And yeah, downloads from IA are notoriously slow, especially if you aren't near the Bay Area. |
| 15:18:08 | <erkinalp> | it isn't that slow normally |
| 15:18:21 | <erkinalp> | it was usually a few mbp |
| 15:18:24 | <erkinalp> | s |
| 15:18:40 | <erkinalp> | i could get good dsl speeds of download |
| 15:18:47 | <erkinalp> | not dialup or isdn speeds |
| 15:19:17 | | mgrytbak quits [Client Quit] |
| 15:19:27 | | mgrytbak joins |
| 15:21:04 | <@JAA> | It varies depending on IA load and which server the data is on. |
| 15:23:03 | | dazld quits [Ping timeout: 265 seconds] |
| 15:25:08 | | DogsRNice joins |
| 15:31:15 | | Dango360 quits [Read error: Connection reset by peer] |
| 15:34:03 | | Dango360 (Dango360) joins |
| 15:37:36 | | dazld (dazld) joins |
| 15:44:52 | | mgrytbak quits [Client Quit] |
| 15:45:01 | | mgrytbak joins |
| 15:49:02 | | mgrytbak quits [Client Quit] |
| 15:49:12 | | mgrytbak joins |
| 16:03:15 | | petrichor quits [Client Quit] |
| 16:04:15 | | petrichor (petrichor) joins |
| 16:25:24 | | lflare quits [Quit: Bye] |
| 16:43:27 | | dumbgoy_ joins |
| 17:24:38 | | aninternettroll_ (aninternettroll) joins |
| 17:24:46 | | aninternettroll quits [Read error: Connection reset by peer] |
| 17:25:02 | | aninternettroll_ is now known as aninternettroll |
| 17:27:41 | <erkinalp> | IA downloads are now down to dialup speeds |
| 17:30:31 | <@JAA> | Not surprising. IA is pretty busy recently, and it slows various things to a crawl. |
| 17:31:32 | <@JAA> | Even IA-internal things are slow. One particular item I was monitoring took over 6 days to move 43 GB around internally. |
| 17:31:43 | <fireonlive> | ooof |
| 17:32:03 | <@JAA> | (Move it from S3 to item server, checksums, and mirror to backup server.) |
| 17:40:57 | | erkinalp quits [Remote host closed the connection] |
| 17:41:48 | | erkinalp joins |
| 17:41:59 | <@arkiver> | i'm planning on dusting off wikis-grab for the upcoming deletions of wikis |
| 17:42:09 | <@arkiver> | though those are also largely covered already by wikiteam dumps i believe |
| 17:42:21 | <@arkiver> | the wikis-grab would more be a general method of archiving wikis |
| 17:43:00 | <@arkiver> | project is also coming for ZOWA |
| 17:44:36 | <@arkiver> | any idea for a channel for zowa.app ? |
| 17:45:05 | <@arkiver> | flashfire42: do you know if we have the orange ISP hosting stuff fully covered with AB? |
| 17:48:43 | <pokechu22> | arkiver: have you seen #wikibot? |
| 17:48:58 | <pokechu22> | ah, you're already in that channel |
| 17:49:11 | | Darken (Darken) joins |
| 17:49:25 | <@arkiver> | pokechu22: yeah |
| 17:49:42 | <@arkiver> | i think it's good to have both dumps from there and from a project creating WARCs |
| 17:49:45 | <pokechu22> | I haven't seen wikis-grab before - does it try to do everything wikiteam does, or is it mostly focused on saving the current revision of every page? |
| 17:49:53 | <pokechu22> | Yeah, WARCs are good |
| 17:50:03 | <@arkiver> | current revision |
| 17:50:21 | <@arkiver> | may be good to allow wikiteam higher priority with the dumps than WARCs, since it's more complete |
| 17:50:36 | <@arkiver> | but after that we should attempt to create WARCs as well |
| 17:51:03 | <pokechu22> | Saving the current revision and maybe all pages on the history tab (but not the revisions themselves - just the history list for attribution) probably is enough for WARCs |
| 17:51:36 | <@arkiver> | yeah |
| 17:51:52 | <@JAA> | Love it! |
| 17:51:55 | <@arkiver> | it's largely for URL preservation, so it's in the wayback machine and easily browsable |
| 17:52:12 | <@arkiver> | the dumps can be used to restore a wiki with (right?), but for browsing WARCs are better |
| 17:52:14 | <@arkiver> | JAA: :) |
| 17:53:28 | <@JAA> | Yep, that's accurate. |
| 17:53:44 | <@JAA> | The dumps are entirely unusable for the average person. |
| 17:54:10 | <@arkiver> | and of course outlinks for #// :) |
| 17:55:16 | <nstrom|m> | <arkiver> "any idea for a channel for zowa..." <- zowch |
| 17:55:57 | <flashfire42|m> | Soddy arkiver it’s 4am here and you are lucky I snap awake for random reasons. Orange is far from complete in archivebot. I’ve been launching as many jobs as I can manage but it’s like fighting a fire with a kids water bucket. I’ll get a sampling but not all of it |
| 17:56:50 | <flashfire42|m> | I will continue to throw in a much as possible during the next week but we won’t get it all I can say that with certainty. Not unless we get a stay of execution for another month or 2 |
| 17:57:18 | <flashfire42|m> | Hopefully that info is helpful it is time for me to head back to sleep for another 2 hours. |
| 17:58:22 | <pokechu22> | I'll try to do an !a < list job for it too |
| 17:58:32 | <pokechu22> | The deadline for webs is sooner though |
| 17:58:41 | <erkinalp> | JAA: unless they have a WARC reader, |
| 17:59:22 | <@JAA> | z-oww-a |
| 17:59:26 | | CandidSparrow quits [Quit: Peace Out] |
| 18:00:24 | | CandidSparrow joins |
| 18:00:49 | <imer> | #nowa |
| 18:01:21 | <imer> | although I like the oww one better :D |
| 18:08:49 | <@JAA> | Or perhaps some play on the content. What are some sounds you'd absolutely not want to hear in an ASMR video? |
| 18:09:00 | <nstrom|m> | zowaah, zowie 🤷♂️ |
| 18:10:51 | <imer> | #zo🍽️ |
| 18:14:28 | <@JAA> | My terminal is sad about that last one. |
| 18:17:13 | <imer> | Yeah, lets not :) |
| 18:32:26 | <@arkiver> | :) |
| 18:43:41 | | Island joins |
| 18:53:53 | <thuban> | arkiver, re orange isp hosting: AntoninDelFabbro|m posted a link to a page listing sites, and i have been enumerating them using its api |
| 18:55:46 | <thuban> | if my suspicion that supplying 0 as the category id retrieves all categories is correct, i expect to be able to enumerate 159832 sites (some fraction of which will be duplicates or inaccessible due to various oddnesses) |
| 18:57:20 | <thuban> | i don't think it's realistic to do this 'manually', but maybe some `!a <` jobs? individual sites are quite small as a rule |
| 19:03:36 | <AntoninDelFabbro|m> | I'm really impressed and thankful |
| 19:10:26 | <fireonlive> | i for one vote for emoji channel ;) |
| 19:10:43 | <fireonlive> | :3 |
| 19:11:29 | <erkinalp> | JAA: wowturkey definitively dead, we can update the deadwatch now (death date: 2023-08-27,0400Z) |
| 19:11:58 | <erkinalp> | s/deadwatch/deathwatch/ |
| 19:11:59 | <fireonlive> | oh it is confirmed by owner? |
| 19:12:25 | <erkinalp> | owner not responding to any correspondence |
| 19:12:32 | <erkinalp> | no hope of coming back up again |
| 19:12:53 | <erkinalp> | the AB job has a few external links (~650 or so) pending |
| 19:12:56 | <fireonlive> | ah :( |
| 19:14:11 | <erkinalp> | to skip wowturkey.com without impacting the remaining ~650 external resources, i'd propose to temporarily map wowturkey.com to 0.0.0.0 ;[ |
| 19:14:15 | <@JAA> | :-( |
| 19:14:44 | <erkinalp> | in the bot's end i mean |
| 19:15:26 | <@JAA> | The AB job is paused, and the offsite URLs aren't in danger, so we can let it sit until the true deadline just in case it comes back. |
| 19:15:50 | <erkinalp> | oh, i though it was looping over and ober |
| 19:15:56 | <erkinalp> | good that it's paused |
| 19:16:49 | <erkinalp> | if it doesn't come back up until 23 september, then it's DaaD |
| 19:17:15 | <erkinalp> | (23 september is when the hosting expires, exactly 22 years from the website's start) |
| 19:17:35 | | yts98 leaves |
| 19:17:51 | | yts98 joins |
| 19:17:52 | <erkinalp> | and the shutdown was exactly 20 years and 1 day from the first turkish language post |
| 19:18:21 | <erkinalp> | wowturkey initially consisted of english threads |
| 19:18:27 | | null joins |
| 19:18:31 | <erkinalp> | promoting turkey to outsiders |
| 19:21:34 | <pokechu22> | https://transfer.archivete.am/inline/eDzUk/monsite-orange.fr_seed_urls.txt - this is the smaller one of the two :| |
| 19:22:00 | <pokechu22> | (this also contains urls from monsite.orange.fr and monsite.wanadoo.fr, both of which give a page redirecting (but not a 3xx redirect) to monsite-orange.fr) |
| 19:22:18 | | rktk quits [Ping timeout: 265 seconds] |
| 19:24:15 | <thuban> | pokechu22: how was that list collected? |
| 19:24:35 | <@JAA> | erkinalp: Well, to be precise, 'paused' here just means a very slow request rate (one request every five minutes in this case), not actually paused. |
| 19:24:44 | <@JAA> | Also, not sure where you got that 650 number from. |
| 19:25:02 | <pokechu22> | Most of it was https://archive.org/developers/wayback-cdx-server.html (e.g. https://web.archive.org/cdx/search/cdx?url=pagesperso-orange.fr&matchType=domain&collapse=urlkey&fl=original&limit=100000&showResumeKey=1&resumeKey=fr%2Cpagesperso-orange%2Clignerolles-allier%29%2Fcartes_postales%2Fteillet%2520argenty%2Falbum%2Fslides%2Fle%2520tumulus.html+20141112051426) - I also |
| 19:25:04 | <pokechu22> | mixed in a list from #webroasting a while back |
| 19:25:16 | <@JAA> | There are about 8.3k offsite URLs in the remaining queue. |
| 19:25:33 | <pokechu22> | SrainUser's https://transfer.archivete.am/Y5Qsp/orange_isp_hosting_urls.txt which I think was scraped from the list the site gives but I'm not 100% sure |
| 19:26:00 | <pokechu22> | and, yes, there's a fair bit of garbage on my list - easier to let it be attempted and fail than to try to filter it out |
| 19:26:15 | <@arkiver> | thuban: if you have a list of sites, please do post them! |
| 19:26:27 | <thuban> | arkiver: still processing, will do |
| 19:26:55 | <pokechu22> | I can deduplicate my list against anything you find and start a second job for whatever's missing |
| 19:28:10 | <pokechu22> | It looks like there's a pagespro-orange.fr in addition to a pagesperso-orange.fr incidentally |
| 19:28:46 | <thuban> | yep |
| 19:29:37 | <@arkiver> | thuban: thank you |
| 19:29:52 | <@arkiver> | and in the meantime - all those queuing AB jobs for orange, please keep doing that |
| 19:31:30 | <pokechu22> | I'm currently doing an !a < list AB job for it - this is easier since it's one job for thousands of sites, but it's a bit buggy in that if the sites link to eachother, it might not recurse properly. Still, it seems like the most practical way to do this |
| 19:32:51 | <erkinalp> | JAA: thanks for the number |
| 19:33:24 | <@arkiver> | erkinalp: we got a pretty serious chunk of it i believe |
| 19:34:50 | <erkinalp> | 89% of items saved |
| 19:35:06 | <erkinalp> | maybe more |
| 19:35:48 | <erkinalp> | (the website had 9.35M posts, according to their own stats) |
| 19:35:50 | <@arkiver> | that's good! |
| 19:36:06 | <fireonlive> | :D |
| 19:36:13 | <@arkiver> | not sure the percentage is correct, but we got more than half i think |
| 19:36:14 | <erkinalp> | after scraping and reconstruction, i might actually get more posts |
| 19:36:30 | <thuban> | pokechu22: agreed re practicality (and i don't think sites linking to each other will be a problem--at worst, it won't get pages linked to by other sites but not their host site's homepage) |
| 19:37:06 | <thuban> | (and at best it will work fine, although there seems to be some confusion about this https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323962 ... https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323982) |
| 19:38:46 | <thuban> | i see that you have a job running for monsite-orange.fr_seed_urls.txt; are you going to start another for orange_isp_hosting_urls.txt? (or has that already been done?) |
| 19:41:38 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 19:42:10 | | AmAnd0A joins |
| 19:43:10 | <pokechu22> | I'm working on building my own list for the orange one based on orange_isp_hosting_urls.txt but I'm not going to run orange_isp_hosting_urls.txt directly |
| 19:44:17 | <thuban> | ok, cool |
| 19:46:13 | <@JAA> | erkinalp: Those stats are not right. You'd have to analyse the WARC data to tell how much we covered. The number of URLs retrieved is not really correlated to that in a meaningful way. |
| 19:46:25 | <@JAA> | We fetched offsite URLs, we fetched rating.php, and so on. |
| 19:47:51 | <@JAA> | A coarser estimate would be possible by analysing just the log file and retrieving how many topic IDs appear there, but later pages could be missing, so that's still only a rough estimate. |
| 19:49:45 | <erkinalp> | JAA: yeah that was what i was referring to by "after .. reconstruction, i might actually get more posts" |
| 19:50:14 | <@JAA> | Or less. We don't know how far it got through the forum pagination. |
| 19:52:08 | <erkinalp> | thankfully wowturkey's viewtopic.php page size is fixed at 10, and ttum.php page size is fixed at 100 |
| 19:53:51 | <erkinalp> | and both were configured in a manner to link to the most recent pages of each respective topic |
| 19:55:36 | | Darken quits [Remote host closed the connection] |
| 20:08:28 | | jtagcat is now authenticated as * |
| 20:08:28 | | jtagcat quits [Killed (ing.hackint.org (Nickname regained by services))] |
| 20:08:30 | | jtagcat (jtagcat) joins |
| 20:09:48 | | CandidSparrow quits [Client Quit] |
| 20:09:48 | | IDK_ quits [Quit: Ping timeout (120 seconds)] |
| 20:09:53 | | Shampoo2140 quits [Remote host closed the connection] |
| 20:10:13 | | IDK_ joins |
| 20:10:56 | | CandidSparrow joins |
| 20:11:03 | | Shampoo2140 joins |
| 20:14:01 | | dazld quits [Ping timeout: 265 seconds] |
| 20:14:16 | | that_lurker quits [Quit: Clowning around is not the same as fooling around...I am a clown, not a fool] |
| 20:14:24 | | that_lurker (that_lurker) joins |
| 20:18:24 | | qwertyasdfuiopghjkl quits [Quit: qwertyasdfuiopghjkl] |
| 20:23:49 | | dazld (dazld) joins |
| 20:41:54 | | miki_57 quits [Client Quit] |
| 20:44:34 | | szczot3k (szczot3k) joins |
| 20:44:57 | <szczot3k> | Hi, how can I help with the efforts? Is an v6-space-holder useful anyhow? |
| 20:48:12 | <@rewby> | IIRC there's at least one project currently active that uses a ton of v6 ips |
| 20:48:20 | <@rewby> | I forget which one, imer wrote the code for it |
| 20:48:23 | <@rewby> | Or rather |
| 20:48:30 | <@rewby> | imer wrote the deployment code that makes it do lots of v6 |
| 20:48:50 | <szczot3k> | Well, I can technically use a whole /39, so ready to help out |
| 20:49:10 | <imer> | #deadcat but we're bandwidth limited there anyways (and they seem to rate limit per single ip or something like that) |
| 20:49:12 | <DigitalDragons> | #deadcat is the one |
| 20:49:27 | <imer> | target bandwidth limited that is |
| 20:49:56 | <@rewby> | Also, I could see if I can hook up my spare /32 to some workers at some point |
| 20:50:39 | <@rewby> | I'm just busy with targets |
| 20:50:39 | <imer> | szczot3k: here's the aforementioned code/script as well: https://gist.github.com/imerr/614e534218a6b93be1a40b088dee885a |
| 20:50:51 | <DigitalDragons> | i heard #sweet supports ipv6 too but I don't know about their ratelimiting |
| 20:51:26 | <imer> | there is none unless you go way too fast and then they will (it seems) manually block your ip |
| 20:51:54 | <DigitalDragons> | hah |
| 20:51:56 | <flashfire42> | Ok got back through scrollback and it seems the consensus is to ignore webs for the moment and focus on orange? cc arkiver |
| 20:52:49 | <DigitalDragons> | also, glad to hear about wikis-grab! |
| 20:55:09 | <DigitalDragons> | i have some wikibot #// extraction almost ready but unsure about filtering |
| 21:00:10 | | Exorcism quits [Client Quit] |
| 21:04:31 | | miki_57 joins |
| 21:05:03 | | miki_57 quits [Max SendQ exceeded] |
| 21:05:05 | | miki_57 joins |
| 21:21:02 | | nexusxe (nexusxe) joins |
| 21:23:38 | | erkinalp quits [Ping timeout: 265 seconds] |
| 21:37:12 | | Larsenv7 (Larsenv) joins |
| 21:40:03 | | Larsenv quits [Ping timeout: 265 seconds] |
| 21:42:28 | | Larsenv7 quits [Ping timeout: 265 seconds] |
| 21:47:09 | | Larsenv7 (Larsenv) joins |
| 21:50:56 | | Larsenv76 (Larsenv) joins |
| 21:53:35 | | Larsenv7 quits [Ping timeout: 265 seconds] |
| 21:53:35 | | Larsenv76 is now known as Larsenv7 |
| 21:54:30 | | Larsenv7 quits [Read error: Connection reset by peer] |
| 21:59:46 | | Larsenv76 (Larsenv) joins |
| 22:02:44 | | Larsenv76 quits [Client Quit] |
| 22:04:12 | | Larsenv76 (Larsenv) joins |
| 22:05:25 | | Larsenv76 quits [Client Quit] |
| 22:06:26 | | dumbgoy_ quits [Read error: Connection reset by peer] |
| 22:06:44 | | Larsenv76 (Larsenv) joins |
| 22:08:09 | | dumbgoy joins |
| 22:08:43 | | dumbgoy quits [Read error: Connection reset by peer] |
| 22:09:04 | | Larsenv76 quits [Client Quit] |
| 22:10:41 | | Larsenv76 (Larsenv) joins |
| 22:16:54 | | Larsenv76 quits [Client Quit] |
| 22:18:33 | <thuban> | ok, orange.fr enumeration finished and spot-checks suggest that i got all the categories |
| 22:18:38 | <thuban> | processing the results now |
| 22:19:26 | | BlueMaxima joins |
| 22:20:11 | <thuban> | malformed urls won't break archivebot, right? there are a few fun ones in here, like `usftennis2.monsite-orange.fr/index.html#="'><h1>abcd</h1>${{7*7}}${7*7}%{7+7}[[7*7]]@(1+2)<%= 7*7 %>` and `monsite.orange.fr@la-canaliere` |
| 22:21:01 | | Larsenv76 (Larsenv) joins |
| 22:21:14 | | Larsenv76 quits [Client Quit] |
| 22:22:27 | | dumbgoy joins |
| 22:22:38 | | Larsenv76 (Larsenv) joins |
| 22:23:40 | | Larsenv76 quits [Client Quit] |
| 22:23:40 | <pokechu22> | Right |
| 22:23:56 | <pokechu22> | the first one would just be treated as usftennis2.monsite-orange.fr/index.html because of the # |
| 22:24:24 | <pokechu22> | the second one would be probably treated as trying to log in as user monsite.orange.fr on site http://la-canaliere which obviously won't work, but will fail in an acceptable way |
| 22:24:35 | <h2ibot> | FireonLive edited Deathwatch (+294, move wowTURKEY to dead (we should use that…): https://wiki.archiveteam.org/?diff=50602&oldid=50598 |
| 22:24:44 | <pokechu22> | the main thing that breaks archivebot is FTP - there are a few other things that can cause problems but they aren't easy to control for |
| 22:24:50 | | Larsenv76 (Larsenv) joins |
| 22:24:56 | | Larsenv76 quits [Client Quit] |
| 22:29:22 | | Larsenv76 (Larsenv) joins |
| 22:29:26 | | Larsenv76 quits [Client Quit] |
| 22:33:37 | <h2ibot> | FireonLive edited Deathwatch (+2, fix url for 2028-Russia going to example.com…): https://wiki.archiveteam.org/?diff=50603&oldid=50602 |
| 22:33:41 | <fireonlive> | (i was like example.com?!) |
| 22:34:43 | <@JAA> | That mistake is so common. |
| 22:35:07 | <@JAA> | I wish there was a way to make edits throw an error when a template isn't used correctly. |
| 22:35:24 | <@JAA> | Probably possible with an extension or something ridiculous like that. |
| 22:37:27 | <pokechu22> | You could probably use an editfilter |
| 22:37:48 | <pokechu22> | er, for that one, probably the right thing to do is make it generate a big red message of anger instead of silently using example.com |
| 22:38:20 | <@JAA> | We do have https://wiki.archiveteam.org/index.php/Category:Pages_with_broken_URLs for all uses of Template:URL where the URL is empty. |
| 22:38:40 | <@JAA> | I just remembered that I added that at one point. |
| 22:40:40 | <pokechu22> | https://en.wikipedia.org/wiki/Module:Check_for_unknown_parameters exists but I don't think lua is enabled on the AT wiki |
| 22:41:02 | | Larsenv76 (Larsenv) joins |
| 22:42:38 | <h2ibot> | Pokechu22 edited Template:Url (+138, add visible warning about broken URLs): https://wiki.archiveteam.org/?diff=50604&oldid=49244 |
| 22:42:39 | <h2ibot> | Pokechu22 edited Reddit (-1, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50605&oldid=49987 |
| 22:43:38 | <h2ibot> | Pokechu22 edited Talk:Twitter (+2, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50606&oldid=49771 |
| 22:44:38 | <h2ibot> | Pokechu22 created Category:Pages with broken URLs (+210, Created page with "Pages that use…): https://wiki.archiveteam.org/?title=Category%3APages%20with%20broken%20URLs |
| 22:45:32 | <@JAA> | Good idea, thanks. |
| 22:45:58 | | lukash9 joins |
| 22:46:37 | <@JAA> | Should be good enough. |
| 22:46:38 | | h3ndr1k quits [Quit: ] |
| 22:46:59 | | h3ndr1k (h3ndr1k) joins |
| 22:56:18 | <fireonlive> | awesome ^_^ |
| 23:15:42 | | railen63 quits [Remote host closed the connection] |
| 23:19:12 | | railen63 joins |
| 23:22:02 | | c joins |
| 23:22:30 | | c quits [Remote host closed the connection] |
| 23:31:46 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
| 23:54:56 | | imer4 (imer) joins |
| 23:58:38 | | imer quits [Ping timeout: 252 seconds] |
| 23:59:12 | | imer (imer) joins |