00:21:15BornOn420 quits [Ping timeout: 272 seconds]
00:22:42BornOn420 (BornOn420) joins
00:33:17BornOn420 quits [Ping timeout: 272 seconds]
00:38:12BornOn420 (BornOn420) joins
00:52:17BornOn420 quits [Ping timeout: 272 seconds]
00:54:37benjinsmi joins
00:54:43DogsRNice quits [Remote host closed the connection]
00:54:43benjinsm quits [Remote host closed the connection]
00:54:43mindstrut1 quits [Remote host closed the connection]
00:54:43abirkill quits [Client Quit]
00:54:46Island quits [Remote host closed the connection]
00:54:48DogsRNice joins
00:54:56Island joins
00:55:03abirkill (abirkill) joins
00:55:39mindstrut joins
00:57:36Island_ joins
00:57:37TheTechRobo quits [Client Quit]
00:57:38Island quits [Remote host closed the connection]
00:58:05TheTechRobo8 (TheTechRobo) joins
01:00:55Island__ joins
01:01:14TheTechRobo8 quits [Excess Flood]
01:01:14Island_ quits [Remote host closed the connection]
01:01:14DogsRNice quits [Remote host closed the connection]
01:01:14DogsRNice joins
01:01:50TheTechRobo (TheTechRobo) joins
01:03:07BornOn420 (BornOn420) joins
01:04:33TheTechRobo quits [Excess Flood]
01:05:12TheTechRobo (TheTechRobo) joins
01:21:25BornOn420 quits [Ping timeout: 272 seconds]
01:31:46etnguyen03 (etnguyen03) joins
01:33:01BornOn420 (BornOn420) joins
01:39:09BornOn420 quits [Ping timeout: 272 seconds]
01:43:54BornOn420 (BornOn420) joins
01:48:47icedice quits [Client Quit]
01:50:33BornOn420 quits [Ping timeout: 272 seconds]
01:53:43tzt quits [Ping timeout: 272 seconds]
01:56:25tzt (tzt) joins
01:59:25etnguyen03 quits [Ping timeout: 272 seconds]
02:01:28BornOn420 (BornOn420) joins
02:12:43BornOn420 quits [Ping timeout: 272 seconds]
02:14:56BornOn420 (BornOn420) joins
02:21:55Wohlstand (Wohlstand) joins
02:26:01BornOn420 quits [Ping timeout: 272 seconds]
02:37:10BornOn420 (BornOn420) joins
02:37:33etnguyen03 (etnguyen03) joins
02:51:59BornOn420 quits [Ping timeout: 272 seconds]
02:55:46BornOn420 (BornOn420) joins
03:02:45BornOn420 quits [Ping timeout: 272 seconds]
03:14:21BornOn420 (BornOn420) joins
03:21:07BornOn420 quits [Ping timeout: 272 seconds]
03:22:46BornOn420 (BornOn420) joins
03:33:09BornOn420 quits [Ping timeout: 272 seconds]
03:38:33BornOn420 (BornOn420) joins
03:52:09BornOn420 quits [Ping timeout: 272 seconds]
04:03:39BornOn420 (BornOn420) joins
04:10:31BornOn420 quits [Ping timeout: 272 seconds]
04:16:58BornOn420 (BornOn420) joins
04:23:11BornOn420 quits [Ping timeout: 272 seconds]
04:24:04BornOn420 (BornOn420) joins
04:27:09dumbgoy quits [Ping timeout: 265 seconds]
04:30:47BornOn420 quits [Ping timeout: 272 seconds]
04:41:32BornOn420 (BornOn420) joins
04:52:57BornOn420 quits [Ping timeout: 272 seconds]
04:54:41BornOn420 (BornOn420) joins
05:03:06treora quits [Remote host closed the connection]
05:07:15treora joins
05:12:04BlueMaxima_ quits [Client Quit]
05:47:23nicolas17 quits [Quit: Konversation terminated!]
05:57:42etnguyen03 quits [Client Quit]
06:07:24Island__ quits [Read error: Connection reset by peer]
06:27:20Arcorann (Arcorann) joins
06:34:15DogsRNice quits [Read error: Connection reset by peer]
06:58:56jtagcat quits [Client Quit]
06:59:09jtagcat (jtagcat) joins
07:14:31Dango360 quits [Read error: Connection reset by peer]
07:15:28eroc1990 (eroc1990) joins
07:17:21eroc19906 quits [Ping timeout: 272 seconds]
07:52:36<masterX244>Brickset forums living on borrowed time... should be down already but it isnt and still some random posts coming in over time. Need to catch them with followup pulls
07:53:37<masterX244>JAA: did you catch motor-talk fully? They will stay up but if a user opts out on the transfer to new owner they zap its data, not sure if posts get zapped, too
08:00:11nfriedly quits [Remote host closed the connection]
08:09:14Arcorann quits [Remote host closed the connection]
08:15:18Arcorann (Arcorann) joins
10:00:03Bleo1 quits [Client Quit]
10:01:16Bleo1 joins
10:13:55sec^nd quits [Remote host closed the connection]
10:14:17sec^nd (second) joins
11:31:13nfriedly joins
12:11:11sec^nd quits [Remote host closed the connection]
12:11:30sec^nd (second) joins
12:12:10benjinsm joins
12:16:17benjinsmi quits [Ping timeout: 272 seconds]
12:29:31Megame (Megame) joins
12:33:09bf_ joins
12:38:54bf_ quits [Remote host closed the connection]
12:50:06benjinsmi joins
12:50:35icedice (icedice) joins
12:53:41benjinsm quits [Ping timeout: 265 seconds]
13:02:31Arcorann quits [Ping timeout: 272 seconds]
13:06:33benjinsm joins
13:08:51AK quits [Ping timeout: 272 seconds]
13:09:30benjins joins
13:10:45benjinsmi quits [Ping timeout: 272 seconds]
13:12:32benjinsm quits [Ping timeout: 265 seconds]
13:16:11AK (AK) joins
13:26:37<AK>Anyone else hosting in Hetzner get hit by that switch going down?
13:27:23<AK>`Switch fault hel1-dc3-sw_21`
13:27:58<murb>presuably only if you're hosting in hel?
13:29:58<AK>Yep only if you were in dc3 too I'd assume
13:37:07<murb>apparently i'm dc1
14:17:51etnguyen03 (etnguyen03) joins
14:23:01nicolas17 joins
14:56:25bf_ joins
15:00:47bf_ quits [Remote host closed the connection]
15:01:25benjinsm joins
15:01:28qwertyasdfuiopghjkl quits [Remote host closed the connection]
15:04:45HP_Archivist quits [Ping timeout: 272 seconds]
15:05:09benjins quits [Ping timeout: 265 seconds]
15:06:38bf_ joins
15:07:52HP_Archivist (HP_Archivist) joins
15:10:38imer quits [Killed (NickServ (GHOST command used by imer0))]
15:10:45imer (imer) joins
15:13:43bf_ quits [Remote host closed the connection]
15:14:35bf_ joins
15:24:23HP_Archivist quits [Ping timeout: 272 seconds]
15:26:17Wohlstand quits [Client Quit]
15:32:07bf_ quits [Remote host closed the connection]
15:32:38bf_ joins
15:33:34bf_ quits [Remote host closed the connection]
15:33:50bf_ joins
15:34:46bf_ quits [Remote host closed the connection]
15:35:27bf_ joins
15:48:33<@JAA>masterX244: I did not, wasn't aware content was still in danger. I see that they now have a JS challenge thing. Funnily enough, there's a noscript meta redirect fallback that ... just bypasses it?
15:54:46qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
16:06:23Island joins
16:14:33balrog_ is now known as balrog
16:23:33bf_ quits [Remote host closed the connection]
16:44:55Larsenv quits [Quit: The Lounge - https://thelounge.chat]
16:53:03icedice quits [Ping timeout: 272 seconds]
16:54:28DigitalDragons quits [Client Quit]
17:04:53nostalgebraist joins
17:26:10Wohlstand (Wohlstand) joins
17:34:31Dango360 (Dango360) joins
17:40:29HP_Archivist (HP_Archivist) joins
17:48:47etnguyen03 quits [Ping timeout: 272 seconds]
18:02:23etnguyen03 (etnguyen03) joins
18:15:06etnguyen03 quits [Ping timeout: 265 seconds]
18:23:14dumbgoy joins
18:29:27etnguyen03 (etnguyen03) joins
18:45:52KrazeeTobi joins
18:46:25Wohlstand quits [Ping timeout: 272 seconds]
18:51:56<KrazeeTobi>Currently archiving a Japanese website with video game information (news in particular) that goes back to 1998; I'm just encountering a problem with grabbing each article (can't seem to find any scraper that can actually grab all of the links in the html, despite them being in plaintext view). I've tried using HTTrack, wget, and such to no avail.
18:51:56<KrazeeTobi>If somebody could give me a hand into resolving the problem, thank you in advance.
18:52:43<KrazeeTobi>I've managed to crack the problem for the 1998-2004 URLs but not for 2005-2011: https://nlab.itmedia.co.jp/games/news/0501.html
18:52:57DigitalDragons (DigitalDragons) joins
18:55:36<masterX244>JAA: that shreddering happens in december, so still a bit of time left. bricksetforums seem to be down finally, there was a hour without login possible before the final kill which allowed me to snag the last few posts
18:56:14<masterX244>gonna prepare a "cargo train" of WARCs
18:56:41<@arkiver>KrazeeTobi: the links are to a different domain
18:56:57<@arkiver>likely whatever you use sticks to the domain it starts with
18:57:53<KrazeeTobi>Odd, that domain redirects to a valid link...
18:58:46<@JAA>KrazeeTobi: You'll need to allow it to follow links to the 'gamez' subdomain, yeah.
18:59:21<pokechu22>With the way archivebot's redirect handling works, it'd recurse properly on those, right?
18:59:35<@JAA>masterX244: Ack. I also ran Brickset through AB, and that finished in time as well I think.
18:59:38<pokechu22>The behavior I've seen is that it recurses if the redirect source *OR* destination is onsite
18:59:44<@JAA>pokechu22: For some value of 'properly'.
19:00:04<@JAA>It wouldn't follow outlinks on the article pages, I think.
19:00:19<@JAA>Not sure about recursing through onsite links on them.
19:00:21<KrazeeTobi>This is not a couple-thousand we're talking to be fair, so AB may not be a great choice
19:00:23<@JAA>Probably not that either.
19:00:40<@JAA>KrazeeTobi: AB frequently does millions of URLs these days.
19:01:01<@JAA>The 'a couple thousand' statement is a decade old. :-)
19:01:11<KrazeeTobi>Ah. The wiki's info on the bot gave me the impression that it's usually good for smaller- wait is it?
19:01:32<@JAA>The largest jobs we've run are in the hundreds of millions of URLs.
19:01:52<KrazeeTobi>Well... Okay then, guess I was a bit misled there lol
19:02:02<@JAA>Ah yeah, the wiki page still says 'a few hundred thousand'.
19:02:14<masterX244>JAA: main page or the forums?
19:02:28<@JAA>And I guess that's not too bad as a rule of thumb.
19:02:41<masterX244>main page staying safe, it was the forums that got more and more expensive and less and less traffic thanks to damned social media
19:02:44<@JAA>masterX244: The forums, recursion from https://forum.brickset.com/ with many aggressive ignores.
19:03:11<KrazeeTobi>Better to be safe than sorry lol, even if I've got a data hoard onset going on hahahahah
19:03:45<masterX244>not sure if it got latest stuff, what did you ignore off exactly?
19:04:20<@JAA>Yeah, it almost certainly didn't get everything posted after the job start.
19:04:22<masterX244>sidenote: my older crawls got imgur-processed already since i kept them locally stored, too even though i sent them off to the wayback
19:04:42<@JAA>http://archivebot.com/ignores/5vn0jh3mknwgc1j27ungb7dnt?compact=true
19:04:58<masterX244>ran a ugly "hackery" with a limited crawl off from recent to get a "sync" and then i manually poked for new posts
19:05:59<masterX244>(there were 2 loginonly sections, crawled them in a separate file that wont get archive.org'd, kept those fully isolated), used my logged in "read markers" to scan quickly if i needed to requeue a thread, too for that manual final sync
19:06:42<KrazeeTobi>Yup, telling HTTrack to grab from the same domain seems to have got it to work -- cheers for the assist everyone
19:08:55<masterX244>JAA: used this dirty ignore later on to trim off any offsite media ^https://(?!forum\.brickset\.com|us\.v-cdn\.net|secure\.gravatar\.com)
19:09:40<masterX244>(most users upload straight to the forums and most of the offsite links were already ran thru the pipelines, especially imgur before the shredders ran hot)
19:10:18<pokechu22>Why not ignore gravatar too?
19:10:32<pokechu22>It seems unlikely that gravatar will be deleted afterwards - it could be done in a separate job
19:10:51<@JAA>Lots of the Gravatar URLs redirected back to the forums for the default avatar thingy.
19:11:02<masterX244>most users were regulars, that amount was minuscule
19:11:25<masterX244>had a fuckup and had v-cdn.com initially and not v-cdn.net, quick sqlite dump and generating a list fixed it
19:11:38<masterX244>(aka fixup crawl for that data)
19:11:46<@JAA>Uh, the forums still seem to be up?
19:12:10<masterX244>had maintenance status a little bit ago and that seemed the final coffin nail
19:12:18<@JAA>Ah, probably just caching of the most visited pages, yeah.
19:12:26<@JAA>Getting the 503 after following a couple links.
19:12:58<masterX244>thanks god that the latest threads were cache-sticky so i was able to yoink them off
19:13:06<@JAA>:-)
19:13:38<masterX244>sidenote: suckled with conc 20 at my end to emergency-yoink everything, wasnt sure when the shredders were going to start
19:13:57<masterX244>(aka a IDGAF crawl in AT style)
19:14:25<masterX244>(i can risk burning one or 2 IPs since i got 2 dedis and 2 vservers at hetzner)
19:14:55KrazeeTobi quits [Remote host closed the connection]
19:15:04<@JAA>Hmm, I wonder whether I should also rerun it through AB just to get the most recent few pages of content.
19:15:46<@JAA>Can't hurt, and it'll be tiny.
19:18:26<masterX244>uploading my yoink already
19:18:45evan quits [Remote host closed the connection]
19:18:45shreyasminocha quits [Remote host closed the connection]
19:18:45thehedgeh0g quits [Read error: Connection reset by peer]
19:18:49evan joins
19:18:53shreyasminocha (shreyasminocha) joins
19:18:53thehedgeh0g (mrHedgehog0) joins
19:19:09<masterX244>(keeping the older 2019 mirror, too since that might have deleted content that got lost afterwards, you never know the stupid imagehosters)
19:19:26nostalgebraist quits [Client Quit]
19:20:12<masterX244>one thing that i like on grabsite is that i can reuse a ignoreset from a crawl for future ones.
19:20:50HP_Archivist quits [Ping timeout: 265 seconds]
19:29:31<@JAA>Oh yeah, those Gravatar URLs actually go to vanillicon.com, not back to the forums. Must've mixed that up with something else.
19:30:03<@JAA>Outlinks from the AB crawl will also go into #// after some minor filtering.
19:42:32<apache2>is there a list of domains crawled by IA somewhere? is there a way to submit missing domains in bulk?
19:43:51<@JAA>I don't believe they make anything like that public.
19:45:24DogsRNice joins
19:45:44<apache2>JAA: do you have an idea about where/whom I should submit the missing ones? (I take it they won't appreciate the "archive now" being spammed to death)
19:47:24<@JAA>Not a clue. I'd probably ask info@archive.org for advice.
19:48:15<apache2>JAA: thanks, will do
20:03:29icedice (icedice) joins
20:10:03Wohlstand (Wohlstand) joins
20:25:07dumbgoy quits [Ping timeout: 265 seconds]
20:28:27<h2ibot>Systwi edited IRC (+357, Added some IRC server URLs, sorted the list…): https://wiki.archiveteam.org/?diff=51105&oldid=50458
20:50:25shinji257_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
21:03:33BlueMaxima joins
21:13:45pupnik joins
21:14:03<pupnik>thank you for preserving the maemo links o/
21:17:56shinji257 (shinji257) joins
21:19:14shinji257 quits [Client Quit]
21:25:51shinji257 (shinji257) joins
21:29:14dumbgoy joins
21:31:04bf_ joins
21:33:08bf_ quits [Remote host closed the connection]
21:37:21null quits [Quit: ZNC - https://znc.in]
21:41:10Megame quits [Client Quit]
21:45:13bf_ joins
21:57:29etnguyen03 quits [Remote host closed the connection]
21:58:55etnguyen03 (etnguyen03) joins
22:00:25ThetaDev_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
22:00:33ThetaDev joins
22:12:02Wohlstand quits [Client Quit]
22:35:53rktk (rktk) joins
22:53:59etnguyen03 quits [Ping timeout: 265 seconds]
23:13:23etnguyen03 (etnguyen03) joins
23:16:49shinji257 quits [Client Quit]
23:22:20ThreeHM_ (ThreeHeadedMonkey) joins
23:23:11ThreeHM quits [Ping timeout: 272 seconds]
23:23:24ThreeHM_ is now known as ThreeHM
23:41:10Arcorann (Arcorann) joins
23:46:54Mateon1 joins
23:52:32NIC007a83 joins
23:52:35NIC007a83 quits [Remote host closed the connection]
23:55:23shinji257 (shinji257) joins