| 00:05:26 | | HackMii_ quits [Remote host closed the connection] |
| 00:11:45 | | HackMii_ (hacktheplanet) joins |
| 00:45:21 | | etnguyen03 (etnguyen03) joins |
| 00:55:49 | | sec^nd quits [Ping timeout: 258 seconds] |
| 00:58:10 | | Ruthalas quits [Quit: END OF LINE] |
| 00:58:30 | | Ruthalas (Ruthalas) joins |
| 01:00:02 | | dm4v quits [Client Quit] |
| 01:01:02 | | sec^nd (second) joins |
| 01:01:36 | | dm4v joins |
| 01:01:38 | | dm4v is now authenticated as dm4v |
| 01:01:38 | | dm4v quits [Changing host] |
| 01:01:38 | | dm4v (dm4v) joins |
| 01:02:37 | | K4 joins |
| 01:05:49 | | K4 quits [Client Quit] |
| 01:19:04 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 01:19:43 | | VerifiedJ (VerifiedJ) joins |
| 01:28:26 | | pabs quits [Remote host closed the connection] |
| 01:31:54 | | pabs (pabs) joins |
| 01:57:32 | | Webuser384 quits [Ping timeout: 244 seconds] |
| 02:02:54 | | dm4v quits [Ping timeout: 258 seconds] |
| 02:05:16 | | fangfufu joins |
| 02:07:23 | | dm4v joins |
| 02:07:25 | | dm4v is now authenticated as dm4v |
| 02:07:25 | | dm4v quits [Changing host] |
| 02:07:25 | | dm4v (dm4v) joins |
| 02:13:15 | | katocala quits [Ping timeout: 258 seconds] |
| 02:13:39 | | katocala joins |
| 02:14:53 | | katocala is now authenticated as katocala |
| 02:33:51 | | monoxane4 quits [Client Quit] |
| 02:34:05 | | monoxane4 (monoxane) joins |
| 02:50:49 | | Myself quits [Ping timeout: 258 seconds] |
| 02:55:02 | | etnguyen03 quits [Ping timeout: 258 seconds] |
| 03:00:57 | | etnguyen03 (etnguyen03) joins |
| 03:19:34 | | katocala quits [Ping timeout: 258 seconds] |
| 03:20:37 | | katocala joins |
| 03:22:27 | | katocala is now authenticated as katocala |
| 03:24:49 | | anarcat (anarcat) joins |
| 03:27:49 | | jamesp quits [Client Quit] |
| 04:00:25 | | etnguyen03 quits [Client Quit] |
| 04:00:35 | | etnguyen03 (etnguyen03) joins |
| 04:37:36 | | qw3rty__ joins |
| 04:41:13 | | qw3rty_ quits [Ping timeout: 258 seconds] |
| 05:01:18 | | march_happy (march_happy) joins |
| 05:02:34 | <march_happy> | Long time no see... First thing is that I successfully archived Autodesk Sketchbook UWP version's installer! |
| 05:03:09 | <march_happy> | https://github.com/Aster-the-Med-Stu/Autodesk-Sketchbook-UWP-Backup |
| 05:03:20 | <march_happy> | Together with brushes |
| 05:04:01 | <march_happy> | Later I would upload all the content to archive.org when I found a proper way to save Autodesk Sketchbook's blog. |
| 05:07:12 | <march_happy> | And I have a question about web archive. When I was participating the Google+ archive... How did you guys handle the collected data so that it will fit WayBack Machine? |
| 05:07:51 | <march_happy> | Asking this because I am trying to archive Baidu Tieba |
| 05:08:54 | <march_happy> | The code is finished, saving crawls to a database but I have no idea how should I properly save those content on archive.org |
| 05:09:55 | | march_happy quits [Remote host closed the connection] |
| 05:12:16 | | etnguyen03 quits [Ping timeout: 258 seconds] |
| 05:23:15 | | etnguyen03 (etnguyen03) joins |
| 06:08:05 | | etnguyen03 quits [Client Quit] |
| 06:34:26 | | HackMii_ quits [Remote host closed the connection] |
| 06:35:17 | | HackMii_ (hacktheplanet) joins |
| 07:09:57 | | IDK_ quits [Client Quit] |
| 07:10:18 | | IDK_ joins |
| 07:15:48 | | betamax quits [Ping timeout: 244 seconds] |
| 07:16:47 | | betamax (betamax) joins |
| 07:22:37 | | tzt quits [Ping timeout: 265 seconds] |
| 07:29:32 | | Lord_Nightmare quits [Quit: ZNC - http://znc.in] |
| 07:32:20 | | Lord_Nightmare (Lord_Nightmare) joins |
| 08:00:14 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 08:02:13 | | HackMii_ quits [Remote host closed the connection] |
| 08:02:50 | | HackMii_ (hacktheplanet) joins |
| 08:06:20 | | qwertyasdfuiopghjkl joins |
| 08:14:55 | <AK> | TheTechRobo, the isitnormal.com AB job is finished :) |
| 09:23:40 | | HackMii_ quits [Remote host closed the connection] |
| 09:24:32 | | HackMii_ (hacktheplanet) joins |
| 09:24:58 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 09:32:02 | | fangfufu quits [Remote host closed the connection] |
| 10:37:43 | | datechnoman quits [Ping timeout: 258 seconds] |
| 10:59:29 | | fangfufu (fangfufu) joins |
| 11:00:26 | | datechnoman (datechnoman) joins |
| 11:26:56 | | HackMii_ quits [Remote host closed the connection] |
| 11:27:53 | | HackMii_ (hacktheplanet) joins |
| 11:34:11 | | TastyWiener95 (TastyWiener95) joins |
| 11:59:24 | | G4te_Keep3r quits [Remote host closed the connection] |
| 12:00:06 | | G4te_Keep3r joins |
| 13:29:04 | | Arcorann quits [Ping timeout: 258 seconds] |
| 13:37:07 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 13:44:06 | | Myself (myself) joins |
| 13:49:54 | | Mateon1 joins |
| 13:55:54 | | Mateon1 quits [Ping timeout: 258 seconds] |
| 14:00:45 | | vukky (Vukky) joins |
| 14:07:51 | | Mateon1 joins |
| 14:21:58 | | qwertyasdfuiopghjkl joins |
| 14:38:05 | | march_happy (march_happy) joins |
| 14:38:28 | | march_happy quits [Client Quit] |
| 14:38:42 | | march_happy (march_happy) joins |
| 14:43:38 | <march_happy> | I guess it's time to ask again in case you were sleeping. How should I export crawled html saved in a database to something that is acceptable for WayBack Machine. |
| 14:44:14 | <march_happy> | I had no experience with WARC and currently I only crawled raw html. |
| 14:45:24 | <@rewby> | While I personally do not have the knowledge to help you, I'd advise you to stick around longer. Last time you left almost immediately after you asked your question and didn't give anyone the time to answer. |
| 14:46:07 | <march_happy> | Meh network outage and I forgot to set timeout for reconnect |
| 14:46:21 | <@rewby> | We can't help you if you're not connected. |
| 14:47:13 | <march_happy> | I know there's log on kiska's server :) |
| 14:47:32 | <@rewby> | Generally people don't answer questions if the asker has left. |
| 14:49:58 | <anarcat> | march_happy: what i know is here https://anarc.at/services/archive/web/ |
| 14:50:50 | <anarcat> | i've also used https://github.com/ludios/grab-site/ |
| 14:50:56 | <anarcat> | for larger, multi-site jobs |
| 14:52:23 | <thuban> | march_happy: the wayback machine works with warcs, not (metadata-less) html, and generally doesn't accept data from third parties due to concerns about data integrity. |
| 14:52:31 | <thuban> | however, the internet archive has an uploading guide here: https://help.archive.org/hc/en-us/articles/360002360111-Uploading-A-Basic-Guide |
| 14:52:37 | <@rewby> | Also, reading back on your questions, I believe the wayback machine only accepts archives in WARC format and that it doesn't just ingest anyone's warcs, you need to get approved by the IA. There's no good way of just getting html into warcs since you'd have to fake the metadata |
| 14:52:44 | <@rewby> | What thuban said, basically |
| 14:52:44 | <thuban> | and would be very happy to have your data in whatever format you find convenient :) |
| 14:54:11 | <@rewby> | Also, when archiving we use software that natively archives to warc files. |
| 14:54:19 | <@rewby> | grab-site is easy to use, but there's other options too |
| 14:54:38 | <march_happy> | But there's WAF |
| 14:54:46 | <march_happy> | So that's why custom crawler is needed |
| 14:55:14 | <@rewby> | You can still emit warc files with a customized crawler (that's how all of our warrior projects work) |
| 14:55:37 | <@rewby> | One way of doing it is with warcprox |
| 14:55:43 | <@rewby> | You just use it as a http-proxy |
| 14:55:48 | <@rewby> | And it'll record warc files of whatever you visit |
| 14:59:50 | <@rewby> | Also, jsut grabbing the html is often not sufficient for a useful archive. There's lots of other assets that are needed to be fetched to recreate a site. |
| 15:28:42 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 15:36:47 | <march_happy> | rewby: I know that but current situation is quite awkward. I am trying to archive deleted posts inside a Baidu Tieba group (Like a subreddit) |
| 15:37:51 | <march_happy> | And the only way to get those posts deleted by malicious admin is to... Recover them... |
| 15:38:24 | <march_happy> | This would mean that they would appear on top, so after crawl those posts have to be deleted again |
| 15:38:37 | <march_happy> | So that it won't flood current posts away |
| 15:39:04 | <march_happy> | My original plan is to only crawl raw html and then download assets later |
| 15:39:58 | <march_happy> | But from what you guys have said and those articles, I must download assets on-the-fly |
| 15:40:10 | <march_happy> | Otherwise it won't fit inside a WARC file |
| 15:42:10 | <march_happy> | BTW I also have a plan to archive current Baidu Tieba |
| 15:43:01 | <march_happy> | Some posts have already been deleted in 2019 under government regulations' requests |
| 15:43:13 | <march_happy> | Without any prior signs |
| 15:44:21 | <march_happy> | This is ~~Spartan~~ China |
| 15:46:36 | | tzt (tzt) joins |
| 15:48:30 | <@rewby> | I'm not an expert on how to make the best crawls. There's a few people who *really* know what they're doing around here. But they usually don't wake up for another few hours |
| 15:53:35 | | HackMii_ quits [Ping timeout: 258 seconds] |
| 15:54:48 | | HackMii_ (hacktheplanet) joins |
| 16:09:31 | | HackMii_ quits [Remote host closed the connection] |
| 16:09:53 | | HackMii_ (hacktheplanet) joins |
| 17:28:46 | | qwertyasdfuiopghjkl joins |
| 17:42:04 | | march_happy quits [Ping timeout: 258 seconds] |
| 17:47:35 | | spirit joins |
| 17:50:28 | | lennier2 joins |
| 17:53:11 | | lennier1 quits [Ping timeout: 258 seconds] |
| 17:53:20 | | lennier2 is now known as lennier1 |
| 19:47:32 | | spirit quits [Client Quit] |
| 20:16:41 | | BlueMaxima joins |
| 20:28:25 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 20:46:50 | | nepeat quits [Ping timeout: 258 seconds] |
| 20:47:13 | | lun4 quits [Ping timeout: 258 seconds] |
| 20:47:36 | | ave quits [Ping timeout: 258 seconds] |
| 20:47:51 | | igloo22225 quits [Ping timeout: 265 seconds] |
| 20:48:13 | | nepeat (nepeat) joins |
| 20:48:23 | | lun4 (lun4) joins |
| 20:58:40 | | ave (ave) joins |
| 20:59:53 | | qwertyasdfuiopghjkl quits [Ping timeout: 244 seconds] |
| 21:05:42 | | qwertyasdfuiopghjkl joins |
| 21:06:29 | | igloo22225 (igloo22225) joins |
| 21:22:39 | | C4K3 joins |
| 21:22:39 | | C4K3 is now authenticated as C4K3 |
| 21:27:00 | | sembiance (sembiance) joins |
| 21:30:20 | | march_happy (march_happy) joins |
| 22:00:12 | | maxfan8 quits [Quit: WeeChat 1.9.1] |
| 22:12:26 | | TastyWiener95 quits [Ping timeout: 240 seconds] |
| 22:18:04 | | datechnoman quits [Ping timeout: 258 seconds] |
| 22:38:52 | | Hackerpcs quits [Quit: Hackerpcs] |
| 22:40:30 | | Hackerpcs (Hackerpcs) joins |
| 22:55:34 | | Arcorann (Arcorann) joins |
| 23:18:39 | | Stiletto quits [Remote host closed the connection] |
| 23:33:58 | | march_happy quits [Ping timeout: 258 seconds] |