00:21:15 | | BornOn420 quits [Ping timeout: 272 seconds] |
00:22:42 | | BornOn420 (BornOn420) joins |
00:33:17 | | BornOn420 quits [Ping timeout: 272 seconds] |
00:38:12 | | BornOn420 (BornOn420) joins |
00:52:17 | | BornOn420 quits [Ping timeout: 272 seconds] |
00:54:37 | | benjinsmi joins |
00:54:43 | | DogsRNice quits [Remote host closed the connection] |
00:54:43 | | benjinsm quits [Remote host closed the connection] |
00:54:43 | | mindstrut1 quits [Remote host closed the connection] |
00:54:43 | | abirkill quits [Client Quit] |
00:54:46 | | Island quits [Remote host closed the connection] |
00:54:48 | | DogsRNice joins |
00:54:56 | | Island joins |
00:55:03 | | abirkill (abirkill) joins |
00:55:39 | | mindstrut joins |
00:57:36 | | Island_ joins |
00:57:37 | | TheTechRobo quits [Client Quit] |
00:57:38 | | Island quits [Remote host closed the connection] |
00:58:05 | | TheTechRobo8 (TheTechRobo) joins |
01:00:55 | | Island__ joins |
01:01:14 | | TheTechRobo8 quits [Excess Flood] |
01:01:14 | | Island_ quits [Remote host closed the connection] |
01:01:14 | | DogsRNice quits [Remote host closed the connection] |
01:01:14 | | DogsRNice joins |
01:01:50 | | TheTechRobo (TheTechRobo) joins |
01:03:07 | | BornOn420 (BornOn420) joins |
01:04:33 | | TheTechRobo quits [Excess Flood] |
01:05:12 | | TheTechRobo (TheTechRobo) joins |
01:21:25 | | BornOn420 quits [Ping timeout: 272 seconds] |
01:31:46 | | etnguyen03 (etnguyen03) joins |
01:33:01 | | BornOn420 (BornOn420) joins |
01:39:09 | | BornOn420 quits [Ping timeout: 272 seconds] |
01:43:54 | | BornOn420 (BornOn420) joins |
01:48:47 | | icedice quits [Client Quit] |
01:50:33 | | BornOn420 quits [Ping timeout: 272 seconds] |
01:53:43 | | tzt quits [Ping timeout: 272 seconds] |
01:56:25 | | tzt (tzt) joins |
01:59:25 | | etnguyen03 quits [Ping timeout: 272 seconds] |
02:01:28 | | BornOn420 (BornOn420) joins |
02:12:43 | | BornOn420 quits [Ping timeout: 272 seconds] |
02:14:56 | | BornOn420 (BornOn420) joins |
02:21:55 | | Wohlstand (Wohlstand) joins |
02:26:01 | | BornOn420 quits [Ping timeout: 272 seconds] |
02:37:10 | | BornOn420 (BornOn420) joins |
02:37:33 | | etnguyen03 (etnguyen03) joins |
02:51:59 | | BornOn420 quits [Ping timeout: 272 seconds] |
02:55:46 | | BornOn420 (BornOn420) joins |
03:02:45 | | BornOn420 quits [Ping timeout: 272 seconds] |
03:14:21 | | BornOn420 (BornOn420) joins |
03:21:07 | | BornOn420 quits [Ping timeout: 272 seconds] |
03:22:46 | | BornOn420 (BornOn420) joins |
03:33:09 | | BornOn420 quits [Ping timeout: 272 seconds] |
03:38:33 | | BornOn420 (BornOn420) joins |
03:52:09 | | BornOn420 quits [Ping timeout: 272 seconds] |
04:03:39 | | BornOn420 (BornOn420) joins |
04:10:31 | | BornOn420 quits [Ping timeout: 272 seconds] |
04:16:58 | | BornOn420 (BornOn420) joins |
04:23:11 | | BornOn420 quits [Ping timeout: 272 seconds] |
04:24:04 | | BornOn420 (BornOn420) joins |
04:27:09 | | dumbgoy quits [Ping timeout: 265 seconds] |
04:30:47 | | BornOn420 quits [Ping timeout: 272 seconds] |
04:41:32 | | BornOn420 (BornOn420) joins |
04:52:57 | | BornOn420 quits [Ping timeout: 272 seconds] |
04:54:41 | | BornOn420 (BornOn420) joins |
05:03:06 | | treora quits [Remote host closed the connection] |
05:07:15 | | treora joins |
05:12:04 | | BlueMaxima_ quits [Client Quit] |
05:47:23 | | nicolas17 quits [Quit: Konversation terminated!] |
05:57:42 | | etnguyen03 quits [Client Quit] |
06:07:24 | | Island__ quits [Read error: Connection reset by peer] |
06:27:20 | | Arcorann (Arcorann) joins |
06:34:15 | | DogsRNice quits [Read error: Connection reset by peer] |
06:58:56 | | jtagcat quits [Client Quit] |
06:59:09 | | jtagcat (jtagcat) joins |
07:14:31 | | Dango360 quits [Read error: Connection reset by peer] |
07:15:28 | | eroc1990 (eroc1990) joins |
07:17:21 | | eroc19906 quits [Ping timeout: 272 seconds] |
07:52:36 | <masterX244> | Brickset forums living on borrowed time... should be down already but it isnt and still some random posts coming in over time. Need to catch them with followup pulls |
07:53:37 | <masterX244> | JAA: did you catch motor-talk fully? They will stay up but if a user opts out on the transfer to new owner they zap its data, not sure if posts get zapped, too |
08:00:11 | | nfriedly quits [Remote host closed the connection] |
08:09:14 | | Arcorann quits [Remote host closed the connection] |
08:15:18 | | Arcorann (Arcorann) joins |
10:00:03 | | Bleo1 quits [Client Quit] |
10:01:16 | | Bleo1 joins |
10:13:55 | | sec^nd quits [Remote host closed the connection] |
10:14:17 | | sec^nd (second) joins |
11:31:13 | | nfriedly joins |
12:11:11 | | sec^nd quits [Remote host closed the connection] |
12:11:30 | | sec^nd (second) joins |
12:12:10 | | benjinsm joins |
12:16:17 | | benjinsmi quits [Ping timeout: 272 seconds] |
12:29:31 | | Megame (Megame) joins |
12:33:09 | | bf_ joins |
12:38:54 | | bf_ quits [Remote host closed the connection] |
12:50:06 | | benjinsmi joins |
12:50:35 | | icedice (icedice) joins |
12:53:41 | | benjinsm quits [Ping timeout: 265 seconds] |
13:02:31 | | Arcorann quits [Ping timeout: 272 seconds] |
13:06:33 | | benjinsm joins |
13:08:51 | | AK quits [Ping timeout: 272 seconds] |
13:09:30 | | benjins joins |
13:10:45 | | benjinsmi quits [Ping timeout: 272 seconds] |
13:12:32 | | benjinsm quits [Ping timeout: 265 seconds] |
13:16:11 | | AK (AK) joins |
13:26:37 | <AK> | Anyone else hosting in Hetzner get hit by that switch going down? |
13:27:23 | <AK> | `Switch fault hel1-dc3-sw_21` |
13:27:58 | <murb> | presuably only if you're hosting in hel? |
13:29:58 | <AK> | Yep only if you were in dc3 too I'd assume |
13:37:07 | <murb> | apparently i'm dc1 |
14:17:51 | | etnguyen03 (etnguyen03) joins |
14:23:01 | | nicolas17 joins |
14:56:25 | | bf_ joins |
15:00:47 | | bf_ quits [Remote host closed the connection] |
15:01:25 | | benjinsm joins |
15:01:28 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
15:04:45 | | HP_Archivist quits [Ping timeout: 272 seconds] |
15:05:09 | | benjins quits [Ping timeout: 265 seconds] |
15:06:38 | | bf_ joins |
15:07:52 | | HP_Archivist (HP_Archivist) joins |
15:10:38 | | imer quits [Killed (NickServ (GHOST command used by imer0))] |
15:10:45 | | imer (imer) joins |
15:13:43 | | bf_ quits [Remote host closed the connection] |
15:14:35 | | bf_ joins |
15:24:23 | | HP_Archivist quits [Ping timeout: 272 seconds] |
15:26:17 | | Wohlstand quits [Client Quit] |
15:32:07 | | bf_ quits [Remote host closed the connection] |
15:32:38 | | bf_ joins |
15:33:34 | | bf_ quits [Remote host closed the connection] |
15:33:50 | | bf_ joins |
15:34:46 | | bf_ quits [Remote host closed the connection] |
15:35:27 | | bf_ joins |
15:48:33 | <@JAA> | masterX244: I did not, wasn't aware content was still in danger. I see that they now have a JS challenge thing. Funnily enough, there's a noscript meta redirect fallback that ... just bypasses it? |
15:54:46 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
16:06:23 | | Island joins |
16:14:33 | | balrog_ is now known as balrog |
16:23:33 | | bf_ quits [Remote host closed the connection] |
16:44:55 | | Larsenv quits [Quit: The Lounge - https://thelounge.chat] |
16:53:03 | | icedice quits [Ping timeout: 272 seconds] |
16:54:28 | | DigitalDragons quits [Client Quit] |
17:04:53 | | nostalgebraist joins |
17:26:10 | | Wohlstand (Wohlstand) joins |
17:34:31 | | Dango360 (Dango360) joins |
17:40:29 | | HP_Archivist (HP_Archivist) joins |
17:48:47 | | etnguyen03 quits [Ping timeout: 272 seconds] |
18:02:23 | | etnguyen03 (etnguyen03) joins |
18:15:06 | | etnguyen03 quits [Ping timeout: 265 seconds] |
18:23:14 | | dumbgoy joins |
18:29:27 | | etnguyen03 (etnguyen03) joins |
18:45:52 | | KrazeeTobi joins |
18:46:25 | | Wohlstand quits [Ping timeout: 272 seconds] |
18:51:56 | <KrazeeTobi> | Currently archiving a Japanese website with video game information (news in particular) that goes back to 1998; I'm just encountering a problem with grabbing each article (can't seem to find any scraper that can actually grab all of the links in the html, despite them being in plaintext view). I've tried using HTTrack, wget, and such to no avail. |
18:51:56 | <KrazeeTobi> | If somebody could give me a hand into resolving the problem, thank you in advance. |
18:52:43 | <KrazeeTobi> | I've managed to crack the problem for the 1998-2004 URLs but not for 2005-2011: https://nlab.itmedia.co.jp/games/news/0501.html |
18:52:57 | | DigitalDragons (DigitalDragons) joins |
18:55:36 | <masterX244> | JAA: that shreddering happens in december, so still a bit of time left. bricksetforums seem to be down finally, there was a hour without login possible before the final kill which allowed me to snag the last few posts |
18:56:14 | <masterX244> | gonna prepare a "cargo train" of WARCs |
18:56:41 | <@arkiver> | KrazeeTobi: the links are to a different domain |
18:56:57 | <@arkiver> | likely whatever you use sticks to the domain it starts with |
18:57:53 | <KrazeeTobi> | Odd, that domain redirects to a valid link... |
18:58:46 | <@JAA> | KrazeeTobi: You'll need to allow it to follow links to the 'gamez' subdomain, yeah. |
18:59:21 | <pokechu22> | With the way archivebot's redirect handling works, it'd recurse properly on those, right? |
18:59:35 | <@JAA> | masterX244: Ack. I also ran Brickset through AB, and that finished in time as well I think. |
18:59:38 | <pokechu22> | The behavior I've seen is that it recurses if the redirect source *OR* destination is onsite |
18:59:44 | <@JAA> | pokechu22: For some value of 'properly'. |
19:00:04 | <@JAA> | It wouldn't follow outlinks on the article pages, I think. |
19:00:19 | <@JAA> | Not sure about recursing through onsite links on them. |
19:00:21 | <KrazeeTobi> | This is not a couple-thousand we're talking to be fair, so AB may not be a great choice |
19:00:23 | <@JAA> | Probably not that either. |
19:00:40 | <@JAA> | KrazeeTobi: AB frequently does millions of URLs these days. |
19:01:01 | <@JAA> | The 'a couple thousand' statement is a decade old. :-) |
19:01:11 | <KrazeeTobi> | Ah. The wiki's info on the bot gave me the impression that it's usually good for smaller- wait is it? |
19:01:32 | <@JAA> | The largest jobs we've run are in the hundreds of millions of URLs. |
19:01:52 | <KrazeeTobi> | Well... Okay then, guess I was a bit misled there lol |
19:02:02 | <@JAA> | Ah yeah, the wiki page still says 'a few hundred thousand'. |
19:02:14 | <masterX244> | JAA: main page or the forums? |
19:02:28 | <@JAA> | And I guess that's not too bad as a rule of thumb. |
19:02:41 | <masterX244> | main page staying safe, it was the forums that got more and more expensive and less and less traffic thanks to damned social media |
19:02:44 | <@JAA> | masterX244: The forums, recursion from https://forum.brickset.com/ with many aggressive ignores. |
19:03:11 | <KrazeeTobi> | Better to be safe than sorry lol, even if I've got a data hoard onset going on hahahahah |
19:03:45 | <masterX244> | not sure if it got latest stuff, what did you ignore off exactly? |
19:04:20 | <@JAA> | Yeah, it almost certainly didn't get everything posted after the job start. |
19:04:22 | <masterX244> | sidenote: my older crawls got imgur-processed already since i kept them locally stored, too even though i sent them off to the wayback |
19:04:42 | <@JAA> | http://archivebot.com/ignores/5vn0jh3mknwgc1j27ungb7dnt?compact=true |
19:04:58 | <masterX244> | ran a ugly "hackery" with a limited crawl off from recent to get a "sync" and then i manually poked for new posts |
19:05:59 | <masterX244> | (there were 2 loginonly sections, crawled them in a separate file that wont get archive.org'd, kept those fully isolated), used my logged in "read markers" to scan quickly if i needed to requeue a thread, too for that manual final sync |
19:06:42 | <KrazeeTobi> | Yup, telling HTTrack to grab from the same domain seems to have got it to work -- cheers for the assist everyone |
19:08:55 | <masterX244> | JAA: used this dirty ignore later on to trim off any offsite media ^https://(?!forum\.brickset\.com|us\.v-cdn\.net|secure\.gravatar\.com) |
19:09:40 | <masterX244> | (most users upload straight to the forums and most of the offsite links were already ran thru the pipelines, especially imgur before the shredders ran hot) |
19:10:18 | <pokechu22> | Why not ignore gravatar too? |
19:10:32 | <pokechu22> | It seems unlikely that gravatar will be deleted afterwards - it could be done in a separate job |
19:10:51 | <@JAA> | Lots of the Gravatar URLs redirected back to the forums for the default avatar thingy. |
19:11:02 | <masterX244> | most users were regulars, that amount was minuscule |
19:11:25 | <masterX244> | had a fuckup and had v-cdn.com initially and not v-cdn.net, quick sqlite dump and generating a list fixed it |
19:11:38 | <masterX244> | (aka fixup crawl for that data) |
19:11:46 | <@JAA> | Uh, the forums still seem to be up? |
19:12:10 | <masterX244> | had maintenance status a little bit ago and that seemed the final coffin nail |
19:12:18 | <@JAA> | Ah, probably just caching of the most visited pages, yeah. |
19:12:26 | <@JAA> | Getting the 503 after following a couple links. |
19:12:58 | <masterX244> | thanks god that the latest threads were cache-sticky so i was able to yoink them off |
19:13:06 | <@JAA> | :-) |
19:13:38 | <masterX244> | sidenote: suckled with conc 20 at my end to emergency-yoink everything, wasnt sure when the shredders were going to start |
19:13:57 | <masterX244> | (aka a IDGAF crawl in AT style) |
19:14:25 | <masterX244> | (i can risk burning one or 2 IPs since i got 2 dedis and 2 vservers at hetzner) |
19:14:55 | | KrazeeTobi quits [Remote host closed the connection] |
19:15:04 | <@JAA> | Hmm, I wonder whether I should also rerun it through AB just to get the most recent few pages of content. |
19:15:46 | <@JAA> | Can't hurt, and it'll be tiny. |
19:18:26 | <masterX244> | uploading my yoink already |
19:18:45 | | evan quits [Remote host closed the connection] |
19:18:45 | | shreyasminocha quits [Remote host closed the connection] |
19:18:45 | | thehedgeh0g quits [Read error: Connection reset by peer] |
19:18:49 | | evan joins |
19:18:53 | | shreyasminocha (shreyasminocha) joins |
19:18:53 | | thehedgeh0g (mrHedgehog0) joins |
19:19:09 | <masterX244> | (keeping the older 2019 mirror, too since that might have deleted content that got lost afterwards, you never know the stupid imagehosters) |
19:19:26 | | nostalgebraist quits [Client Quit] |
19:20:12 | <masterX244> | one thing that i like on grabsite is that i can reuse a ignoreset from a crawl for future ones. |
19:20:50 | | HP_Archivist quits [Ping timeout: 265 seconds] |
19:29:31 | <@JAA> | Oh yeah, those Gravatar URLs actually go to vanillicon.com, not back to the forums. Must've mixed that up with something else. |
19:30:03 | <@JAA> | Outlinks from the AB crawl will also go into #// after some minor filtering. |
19:42:32 | <apache2> | is there a list of domains crawled by IA somewhere? is there a way to submit missing domains in bulk? |
19:43:51 | <@JAA> | I don't believe they make anything like that public. |
19:45:24 | | DogsRNice joins |
19:45:44 | <apache2> | JAA: do you have an idea about where/whom I should submit the missing ones? (I take it they won't appreciate the "archive now" being spammed to death) |
19:47:24 | <@JAA> | Not a clue. I'd probably ask info@archive.org for advice. |
19:48:15 | <apache2> | JAA: thanks, will do |
20:03:29 | | icedice (icedice) joins |
20:10:03 | | Wohlstand (Wohlstand) joins |
20:25:07 | | dumbgoy quits [Ping timeout: 265 seconds] |
20:28:27 | <h2ibot> | Systwi edited IRC (+357, Added some IRC server URLs, sorted the list…): https://wiki.archiveteam.org/?diff=51105&oldid=50458 |
20:50:25 | | shinji257_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
21:03:33 | | BlueMaxima joins |
21:13:45 | | pupnik joins |
21:14:03 | <pupnik> | thank you for preserving the maemo links o/ |
21:17:56 | | shinji257 (shinji257) joins |
21:19:14 | | shinji257 quits [Client Quit] |
21:25:51 | | shinji257 (shinji257) joins |
21:29:14 | | dumbgoy joins |
21:31:04 | | bf_ joins |
21:33:08 | | bf_ quits [Remote host closed the connection] |
21:37:21 | | null quits [Quit: ZNC - https://znc.in] |
21:41:10 | | Megame quits [Client Quit] |
21:45:13 | | bf_ joins |
21:57:29 | | etnguyen03 quits [Remote host closed the connection] |
21:58:55 | | etnguyen03 (etnguyen03) joins |
22:00:25 | | ThetaDev_ quits [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.] |
22:00:33 | | ThetaDev joins |
22:12:02 | | Wohlstand quits [Client Quit] |
22:35:53 | | rktk (rktk) joins |
22:53:59 | | etnguyen03 quits [Ping timeout: 265 seconds] |
23:13:23 | | etnguyen03 (etnguyen03) joins |
23:16:49 | | shinji257 quits [Client Quit] |
23:22:20 | | ThreeHM_ (ThreeHeadedMonkey) joins |
23:23:11 | | ThreeHM quits [Ping timeout: 272 seconds] |
23:23:24 | | ThreeHM_ is now known as ThreeHM |
23:41:10 | | Arcorann (Arcorann) joins |
23:46:54 | | Mateon1 joins |
23:52:32 | | NIC007a83 joins |
23:52:35 | | NIC007a83 quits [Remote host closed the connection] |
23:55:23 | | shinji257 (shinji257) joins |