00:05:26HackMii_ quits [Remote host closed the connection]
00:11:45HackMii_ (hacktheplanet) joins
00:45:21etnguyen03 (etnguyen03) joins
00:55:49sec^nd quits [Ping timeout: 258 seconds]
00:58:10Ruthalas quits [Quit: END OF LINE]
00:58:30Ruthalas (Ruthalas) joins
01:00:02dm4v quits [Client Quit]
01:01:02sec^nd (second) joins
01:01:36dm4v joins
01:01:38dm4v quits [Changing host]
01:01:38dm4v (dm4v) joins
01:02:37K4 joins
01:05:49K4 quits [Client Quit]
01:19:04VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
01:19:43VerifiedJ (VerifiedJ) joins
01:28:26pabs quits [Remote host closed the connection]
01:31:54pabs (pabs) joins
01:57:32Webuser384 quits [Ping timeout: 244 seconds]
02:02:54dm4v quits [Ping timeout: 258 seconds]
02:05:16fangfufu joins
02:07:23dm4v joins
02:07:25dm4v quits [Changing host]
02:07:25dm4v (dm4v) joins
02:13:15katocala quits [Ping timeout: 258 seconds]
02:13:39katocala joins
02:33:51monoxane4 quits [Client Quit]
02:34:05monoxane4 (monoxane) joins
02:50:49Myself quits [Ping timeout: 258 seconds]
02:55:02etnguyen03 quits [Ping timeout: 258 seconds]
03:00:57etnguyen03 (etnguyen03) joins
03:19:34katocala quits [Ping timeout: 258 seconds]
03:20:37katocala joins
03:24:49anarcat (anarcat) joins
03:27:49jamesp quits [Client Quit]
04:00:25etnguyen03 quits [Client Quit]
04:00:35etnguyen03 (etnguyen03) joins
04:37:36qw3rty__ joins
04:41:13qw3rty_ quits [Ping timeout: 258 seconds]
05:01:18march_happy (march_happy) joins
05:02:34<march_happy>Long time no see... First thing is that I successfully archived Autodesk Sketchbook UWP version's installer!
05:03:09<march_happy>https://github.com/Aster-the-Med-Stu/Autodesk-Sketchbook-UWP-Backup
05:03:20<march_happy>Together with brushes
05:04:01<march_happy>Later I would upload all the content to archive.org when I found a proper way to save Autodesk Sketchbook's blog.
05:07:12<march_happy>And I have a question about web archive. When I was participating the Google+ archive... How did you guys handle the collected data so that it will fit WayBack Machine?
05:07:51<march_happy>Asking this because I am trying to archive Baidu Tieba
05:08:54<march_happy>The code is finished, saving crawls to a database but I have no idea how should I properly save those content on archive.org
05:09:55march_happy quits [Remote host closed the connection]
05:12:16etnguyen03 quits [Ping timeout: 258 seconds]
05:23:15etnguyen03 (etnguyen03) joins
06:08:05etnguyen03 quits [Client Quit]
06:34:26HackMii_ quits [Remote host closed the connection]
06:35:17HackMii_ (hacktheplanet) joins
07:09:57IDK_ quits [Client Quit]
07:10:18IDK_ joins
07:15:48betamax quits [Ping timeout: 244 seconds]
07:16:47betamax (betamax) joins
07:22:37tzt quits [Ping timeout: 265 seconds]
07:29:32Lord_Nightmare quits [Quit: ZNC - http://znc.in]
07:32:20Lord_Nightmare (Lord_Nightmare) joins
08:00:14qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
08:02:13HackMii_ quits [Remote host closed the connection]
08:02:50HackMii_ (hacktheplanet) joins
08:06:20qwertyasdfuiopghjkl joins
08:14:55<AK>TheTechRobo, the isitnormal.com AB job is finished :)
09:23:40HackMii_ quits [Remote host closed the connection]
09:24:32HackMii_ (hacktheplanet) joins
09:24:58qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
09:32:02fangfufu quits [Remote host closed the connection]
10:37:43datechnoman quits [Ping timeout: 258 seconds]
10:59:29fangfufu (fangfufu) joins
11:00:26datechnoman (datechnoman) joins
11:26:56HackMii_ quits [Remote host closed the connection]
11:27:53HackMii_ (hacktheplanet) joins
11:34:11TastyWiener95 (TastyWiener95) joins
11:59:24G4te_Keep3r quits [Remote host closed the connection]
12:00:06G4te_Keep3r joins
13:29:04Arcorann quits [Ping timeout: 258 seconds]
13:37:07Mateon1 quits [Ping timeout: 258 seconds]
13:44:06Myself (myself) joins
13:49:54Mateon1 joins
13:55:54Mateon1 quits [Ping timeout: 258 seconds]
14:00:45vukky (Vukky) joins
14:07:51Mateon1 joins
14:21:58qwertyasdfuiopghjkl joins
14:38:05march_happy (march_happy) joins
14:38:28march_happy quits [Client Quit]
14:38:42march_happy (march_happy) joins
14:43:38<march_happy>I guess it's time to ask again in case you were sleeping. How should I export crawled html saved in a database to something that is acceptable for WayBack Machine.
14:44:14<march_happy>I had no experience with WARC and currently I only crawled raw html.
14:45:24<@rewby>While I personally do not have the knowledge to help you, I'd advise you to stick around longer. Last time you left almost immediately after you asked your question and didn't give anyone the time to answer.
14:46:07<march_happy>Meh network outage and I forgot to set timeout for reconnect
14:46:21<@rewby>We can't help you if you're not connected.
14:47:13<march_happy>I know there's log on kiska's server :)
14:47:32<@rewby>Generally people don't answer questions if the asker has left.
14:49:58<anarcat>march_happy: what i know is here https://anarc.at/services/archive/web/
14:50:50<anarcat>i've also used https://github.com/ludios/grab-site/
14:50:56<anarcat>for larger, multi-site jobs
14:52:23<thuban>march_happy: the wayback machine works with warcs, not (metadata-less) html, and generally doesn't accept data from third parties due to concerns about data integrity.
14:52:31<thuban>however, the internet archive has an uploading guide here: https://help.archive.org/hc/en-us/articles/360002360111-Uploading-A-Basic-Guide
14:52:37<@rewby>Also, reading back on your questions, I believe the wayback machine only accepts archives in WARC format and that it doesn't just ingest anyone's warcs, you need to get approved by the IA. There's no good way of just getting html into warcs since you'd have to fake the metadata
14:52:44<@rewby>What thuban said, basically
14:52:44<thuban>and would be very happy to have your data in whatever format you find convenient :)
14:54:11<@rewby>Also, when archiving we use software that natively archives to warc files.
14:54:19<@rewby>grab-site is easy to use, but there's other options too
14:54:38<march_happy>But there's WAF
14:54:46<march_happy>So that's why custom crawler is needed
14:55:14<@rewby>You can still emit warc files with a customized crawler (that's how all of our warrior projects work)
14:55:37<@rewby>One way of doing it is with warcprox
14:55:43<@rewby>You just use it as a http-proxy
14:55:48<@rewby>And it'll record warc files of whatever you visit
14:59:50<@rewby>Also, jsut grabbing the html is often not sufficient for a useful archive. There's lots of other assets that are needed to be fetched to recreate a site.
15:28:42qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
15:36:47<march_happy>rewby: I know that but current situation is quite awkward. I am trying to archive deleted posts inside a Baidu Tieba group (Like a subreddit)
15:37:51<march_happy>And the only way to get those posts deleted by malicious admin is to... Recover them...
15:38:24<march_happy>This would mean that they would appear on top, so after crawl those posts have to be deleted again
15:38:37<march_happy>So that it won't flood current posts away
15:39:04<march_happy>My original plan is to only crawl raw html and then download assets later
15:39:58<march_happy>But from what you guys have said and those articles, I must download assets on-the-fly
15:40:10<march_happy>Otherwise it won't fit inside a WARC file
15:42:10<march_happy>BTW I also have a plan to archive current Baidu Tieba
15:43:01<march_happy>Some posts have already been deleted in 2019 under government regulations' requests
15:43:13<march_happy>Without any prior signs
15:44:21<march_happy>This is ~~Spartan~~ China
15:46:36tzt (tzt) joins
15:48:30<@rewby>I'm not an expert on how to make the best crawls. There's a few people who *really* know what they're doing around here. But they usually don't wake up for another few hours
15:53:35HackMii_ quits [Ping timeout: 258 seconds]
15:54:48HackMii_ (hacktheplanet) joins
16:09:31HackMii_ quits [Remote host closed the connection]
16:09:53HackMii_ (hacktheplanet) joins
17:28:46qwertyasdfuiopghjkl joins
17:42:04march_happy quits [Ping timeout: 258 seconds]
17:47:35spirit joins
17:50:28lennier2 joins
17:53:11lennier1 quits [Ping timeout: 258 seconds]
17:53:20lennier2 is now known as lennier1
19:47:32spirit quits [Client Quit]
20:16:41BlueMaxima joins
20:28:25BlueMaxima quits [Read error: Connection reset by peer]
20:46:50nepeat quits [Ping timeout: 258 seconds]
20:47:13lun4 quits [Ping timeout: 258 seconds]
20:47:36ave quits [Ping timeout: 258 seconds]
20:47:51igloo22225 quits [Ping timeout: 265 seconds]
20:48:13nepeat (nepeat) joins
20:48:23lun4 (lun4) joins
20:58:40ave (ave) joins
20:59:53qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds]
21:05:42qwertyasdfuiopghjkl joins
21:06:29igloo22225 (igloo22225) joins
21:22:39C4K3 joins
21:27:00sembiance (sembiance) joins
21:30:20march_happy (march_happy) joins
22:00:12maxfan8 quits [Quit: WeeChat 1.9.1]
22:12:26TastyWiener95 quits [Ping timeout: 240 seconds]
22:18:04datechnoman quits [Ping timeout: 258 seconds]
22:38:52Hackerpcs quits [Quit: Hackerpcs]
22:40:30Hackerpcs (Hackerpcs) joins
22:55:34Arcorann (Arcorann) joins
23:18:39Stiletto quits [Remote host closed the connection]
23:33:58march_happy quits [Ping timeout: 258 seconds]