00:04:23 | | archiveDrill quits [Quit: The Lounge - https://thelounge.chat] |
00:15:17 | | dabs quits [Read error: Connection reset by peer] |
00:18:33 | <Flashfire42> | https://itch.io/t/5149036/reindexing-adult-nsfw-content |
00:19:13 | <@JAA> | → #scratchtheitch (already posted there) |
00:31:52 | | dabs joins |
00:37:47 | | Guest58 joins |
00:40:33 | | hamouda joins |
00:45:37 | | Wohlstand quits [Quit: Wohlstand] |
00:47:12 | <hamouda> | hii everyone, am back after the archive.org was hacked, am the OP of this post on reddit : https://www.reddit.com/r/Archiveteam/comments/1gdszot/archiving_archives_of_highly_important_lost_forums/ . you've told me to wait till the full rcovery of the website to be able to scrape these archives. I want them to be WARC 1.1 to convert them to ZIM. |
00:47:12 | <hamouda> | any ideas? thank you for giving me this opportunity. |
00:50:53 | <pokechu22> | Are you specifically interested in just the 5 pages linked in that reddit post, or the entirety of https://al-maktaba.org/ (which I'm not sure how to do, since the main page redirects) |
00:53:04 | <pokechu22> | ah, each "book" has multiple pages corresponding to individual threads, hmm |
00:53:21 | <hamouda> | they are not pages , yes. |
00:54:58 | <hamouda> | they are related to shamela.ws website. its one of the best libraries on web. |
00:58:01 | <pokechu22> | I've started an archivebot job: http://archivebot.com/?initialFilter=ahlalhdeeth.com - WARCs will appear at https://archive.fart.website/archivebot/viewer/job/13g82 when it finishes |
00:58:12 | | Webuser903083 joins |
01:00:08 | <hamouda> | thank you so much. will this be WARC 1.1 or 1.0? |
01:02:54 | <pokechu22> | ArchiveBot generates WARC 1.0. I'm not sure what actually changed with WARC 1.1; https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/ seems to have some new features but they probably aren't required |
01:03:32 | <pokechu22> | You could try https://archive.fart.website/archivebot/viewer/job/20240103195343cadvi to see if it works as expected |
01:06:12 | | Webuser220582 joins |
01:07:38 | <hamouda> | ok, will this crawl will get the pages or just the titles of the book, see for example this title : https://al-maktaba.org/book/31621/5 and its pages are https://al-maktaba.org/book/31621/6#p1 https://al-maktaba.org/book/31621/7#p1 https://al-maktaba.org/book/31621/8#p1 etc. |
01:08:10 | | Webuser903083 quits [Client Quit] |
01:09:53 | <pokechu22> | Yes, it should get those |
01:10:53 | <hamouda> | that's great. thank you. |
01:11:14 | <@JAA> | Re WARC 1.0 vs 1.1, see the two diff* files here: https://github.com/JustAnotherArchivist/warc-specifications/tree/comparison-1.0-v-1.1/specifications/warc-format/warc-1.1 |
01:21:57 | <hamouda> | good, I've told you this (WARC1.1) because am using warc2zim to convert the warc 1.1 and playing it with kiwix app. If this WARC 1.0, which I haven't tried to convert from yet. I'll try to test it works normally or not. but I think it should. |
01:27:42 | | etnguyen03 (etnguyen03) joins |
01:31:42 | | nicolas17_ joins |
01:32:59 | | nicolas17 quits [Ping timeout: 260 seconds] |
01:37:29 | | etnguyen03 quits [Client Quit] |
01:46:53 | | etnguyen03 (etnguyen03) joins |
01:47:08 | | nicolas17 joins |
01:47:34 | | nicolas17_ quits [Ping timeout: 260 seconds] |
02:28:03 | | Guest58 quits [Client Quit] |
02:29:16 | | etnguyen03 quits [Remote host closed the connection] |
02:32:19 | | dabs quits [Client Quit] |
03:03:00 | | kansei quits [Quit: ZNC 1.10.1 - https://znc.in] |
03:08:17 | <h2ibot> | TriangleDemon edited YouTube (+48, UPDATE): https://wiki.archiveteam.org/?diff=56659&oldid=56658 |
03:10:10 | | kansei (kansei) joins |
03:14:18 | | Guest58 joins |
03:16:04 | | Hackerpcs quits [Quit: Hackerpcs] |
03:23:22 | | Hackerpcs (Hackerpcs) joins |
03:28:02 | | nicolas17_ joins |
03:28:29 | | nicolas17 quits [Ping timeout: 260 seconds] |
03:29:38 | | Guest58 quits [Read error: Connection reset by peer] |
03:30:05 | | Guest58 joins |
03:41:22 | | nicolas17_ is now known as nicolas17 |
03:41:23 | | nicolas17 is now authenticated as nicolas17 |
03:47:13 | | Guest58_ joins |
03:48:24 | | Guest58_ quits [Client Quit] |
03:50:39 | | Guest58 quits [Ping timeout: 260 seconds] |
04:18:52 | | Guest58 joins |
04:21:36 | | devkev0 (devkev) joins |
04:23:19 | | ATinySpaceMarine quits [Ping timeout: 260 seconds] |
04:23:19 | | devkev quits [Ping timeout: 260 seconds] |
04:23:19 | | devkev0 is now known as devkev |
04:24:43 | | ATinySpaceMarine joins |
04:29:53 | | evergreen quits [Quit: Ping timeout (120 seconds)] |
04:30:02 | | evergreen joins |
04:30:19 | | khaoohs quits [Ping timeout: 260 seconds] |
04:30:34 | | lennier2_ quits [Ping timeout: 240 seconds] |
04:44:50 | | GradientCat quits [Quit: Connection closed for inactivity] |
04:47:34 | | midou quits [Ping timeout: 240 seconds] |
04:55:03 | | midou joins |
05:01:59 | | hamouda quits [Quit: Ooops, wrong browser tab.] |
05:07:17 | | Guest58 quits [Read error: Connection reset by peer] |
05:18:50 | | i_have_n0_idea37 quits [Quit: The Lounge - https://thelounge.chat] |
05:20:49 | | i_have_n0_idea37 (i_have_n0_idea) joins |
05:21:05 | | nicolas17_ joins |
05:23:24 | | nicolas17 quits [Ping timeout: 260 seconds] |
05:23:50 | | khaoohs joins |
05:23:53 | | Webuser220582 quits [Quit: Ooops, wrong browser tab.] |
05:38:55 | | khaoohs quits [Remote host closed the connection] |
05:39:12 | | khaoohs joins |
05:40:13 | | Guest58 joins |
05:58:18 | | feed quits [Quit: Limnoria 2024.12.20] |
05:58:28 | | feed (feed) joins |
06:11:47 | | nicolas17_ is now known as nicolas17 |
06:13:26 | | PredatorIWD25 joins |
06:14:32 | | Island quits [Read error: Connection reset by peer] |
06:33:25 | | awauwa (awauwa) joins |
06:47:49 | | lennier2_ joins |
07:01:00 | | Guest58 quits [Client Quit] |
07:04:58 | | Webuser580573 joins |
07:10:42 | | Webuser580573 quits [Client Quit] |
07:18:54 | | Sokar quits [Read error: Connection reset by peer] |
07:20:51 | | Sokar joins |
07:33:43 | | Guest58 joins |
08:43:12 | <h2ibot> | TriangleDemon edited YouTube (+99, Add YouTube crawls): https://wiki.archiveteam.org/?diff=56660&oldid=56659 |
08:45:12 | <h2ibot> | TriangleDemon edited Yahoo! Video (+87, Add data crawls): https://wiki.archiveteam.org/?diff=56661&oldid=47342 |
08:47:00 | | Dada joins |
08:47:12 | <h2ibot> | TriangleDemon edited TikTok (+38, Add data crawls): https://wiki.archiveteam.org/?diff=56662&oldid=56095 |
08:49:13 | <h2ibot> | TriangleDemon edited Rumble (+213): https://wiki.archiveteam.org/?diff=56663&oldid=51579 |
08:50:13 | <h2ibot> | TriangleDemon uploaded File:Rumble.png: https://wiki.archiveteam.org/?title=File%3ARumble.png |
08:50:14 | <h2ibot> | TriangleDemon uploaded File:Rumble homepage.png: https://wiki.archiveteam.org/?title=File%3ARumble%20homepage.png |
08:52:13 | <h2ibot> | TriangleDemon edited Rumble (+87): https://wiki.archiveteam.org/?diff=56666&oldid=56663 |
09:26:04 | | celestial quits [Ping timeout: 260 seconds] |
09:27:05 | | celestial joins |
09:36:34 | | TheEnbyperor_ quits [Ping timeout: 260 seconds] |
09:36:44 | | TheEnbyperor_ (TheEnbyperor) joins |
09:37:34 | | TheEnbyperor quits [Ping timeout: 240 seconds] |
09:38:41 | | TheEnbyperor_ is now known as TheEnbyperor |
09:38:43 | | TheEnbyperor_ joins |
09:54:57 | | sec^nd quits [Remote host closed the connection] |
09:58:35 | | sec^nd (second) joins |
09:59:20 | | BornOn420 quits [Remote host closed the connection] |
09:59:58 | | BornOn420 (BornOn420) joins |
10:02:36 | | LunarianBunny1147 (LunarianBunny1147) joins |
10:04:34 | | Lunarian1 quits [Ping timeout: 260 seconds] |
11:00:03 | | Bleo182600722719623455222 quits [Quit: The Lounge - https://thelounge.chat] |
11:02:50 | | Bleo182600722719623455222 joins |
11:16:28 | | egallager joins |
12:24:28 | | Deksor joins |
12:25:09 | | Snivy quits [Ping timeout: 260 seconds] |
12:25:27 | <Deksor> | Hello, I just realized that anandtech.com is now fully gone (only the forum is online) |
12:25:27 | <Deksor> | Is the ETA of the archive up to date ? https://wiki.archiveteam.org/index.php/AnandTech :( |
12:27:11 | <Deksor> | Also, where could I download such archive ? |
12:47:44 | <that_lurker> | it should be in the wayback machine. Do you want a local copy? |
12:58:36 | <Deksor> | Yes. |
13:08:44 | | egallager quits [Read error: Connection reset by peer] |
13:08:56 | | egallager joins |
13:35:12 | | ericgallager joins |
13:36:54 | | egallager quits [Ping timeout: 260 seconds] |
14:23:18 | <that_lurker> | Deksor: https://archive.fart.website/archivebot/viewer/job/20240901213047bvqa8 this is most likely the latest copy |
14:29:09 | <Deksor> | thanks ! |
14:37:08 | | PredatorIWD25 quits [Read error: Connection reset by peer] |
14:40:27 | | PredatorIWD25 joins |
14:41:00 | | IDK (IDK) joins |
14:49:49 | | midou quits [Ping timeout: 260 seconds] |
15:00:07 | | midou joins |
15:06:37 | | Barto quits [Quit: WeeChat 4.7.0] |
15:06:56 | | Webuser623865 joins |
15:10:08 | | Webuser623865 quits [Client Quit] |
15:11:03 | | Barto (Barto) joins |
16:05:15 | | Island joins |
16:29:04 | | awauwa quits [Quit: awauwa] |
16:33:26 | | archiveDrill joins |
16:35:07 | | GradientCat (GradientCat) joins |
16:49:55 | | BennyOtt quits [Quit: ZNC 1.10.1 - https://znc.in] |
16:50:56 | | BennyOtt (BennyOtt) joins |
16:54:13 | | BennyOtt quits [Client Quit] |
16:56:03 | | BennyOtt (BennyOtt) joins |
17:23:49 | | nicolas17 quits [Ping timeout: 260 seconds] |
17:26:16 | | nicolas17 joins |
17:38:53 | | UwU joins |
17:44:53 | <mgrandi> | https://cpb.org/pressroom/Corporation-Public-Broadcasting-Addresses-Operations-Following-Loss-Federal-Funding , Should probably throw the site in archive bot |
17:47:11 | <pokechu22> | https://archive.fart.website/archivebot/viewer/job/20250503221447lj30p - looks like that took 16 hours when we did it a few months ago |
17:47:42 | <pokechu22> | will run it in a bit once the election stuff finishes |
17:58:54 | | HP_Archivist (HP_Archivist) joins |
18:17:20 | <yano> | https://blog.google/technology/developers/googl-link-shortening-update/ |
18:18:01 | | justaguy is now known as mystique_altrosky |
18:24:54 | <Jens> | Many such cases. |
18:25:57 | <nulldata> | https://www.cnn.com/2025/08/01/media/trump-cpb-corporation-public-media-shuts-down |
18:26:25 | <@JAA> | → #UncleSamsArchive and #urlteamwasright respectively |
19:15:58 | <h2ibot> | Anonymoususer852 edited Talk:Tracker (+2, Typo "In" → "Done".): https://wiki.archiveteam.org/?diff=56667&oldid=56640 |
19:15:59 | <h2ibot> | Anonymoususer852 edited Talk:Tracker (+13, Place references before my signature.): https://wiki.archiveteam.org/?diff=56668&oldid=56667 |
19:22:16 | | Barto quits [Quit: WeeChat 4.7.0] |
19:24:19 | | Barto (Barto) joins |
19:26:31 | | archiveDrill4 joins |
19:27:14 | | archiveDrill quits [Ping timeout: 240 seconds] |
19:27:14 | | archiveDrill4 is now known as archiveDrill |
19:37:34 | <hexagonwin> | when using grab-site, is it ok to have a very large ignore list(2.7M)? I scraped a website but it's missing quite a few pages, so I'm trying to make it scrape for everything except I already have. |
19:44:58 | | nicolas17 is now authenticated as nicolas17 |
19:48:13 | | CYBERDEV joins |
19:52:39 | <pokechu22> | That's probably not going to perform well |
19:53:07 | <pokechu22> | it might make more sense to somehow add those specific URLs to the database as done, but I don't know exactly how to go about doing that |
19:53:36 | | IDK quits [Quit: Connection closed for inactivity] |
19:56:21 | | cuphead2527480 (Cuphead2527480) joins |
20:29:38 | | FiTheArchiver joins |
20:32:14 | | anonymoususer852 quits [Ping timeout: 260 seconds] |
20:59:14 | | anonymoususer852 (anonymoususer852) joins |
21:03:51 | | etnguyen03 (etnguyen03) joins |
21:14:25 | | GradientCat quits [Quit: Connection closed for inactivity] |
21:21:55 | | Larsenv quits [Quit: The Lounge - https://thelounge.chat] |
21:26:27 | | etnguyen03 quits [Client Quit] |
21:26:48 | | etnguyen03 (etnguyen03) joins |
21:35:21 | <@JAA> | grab-site's ignores are a bit different than AB's, and it might scale a bit better, but 2.7 million sounds like it'll be slow regardless. |
21:36:36 | | etnguyen03 quits [Client Quit] |
21:37:15 | | etnguyen03 (etnguyen03) joins |
21:41:46 | | Larsenv (Larsenv) joins |
21:44:24 | <h2ibot> | TriangleDemon edited GeoCities (+32, Add data crawl): https://wiki.archiveteam.org/?diff=56669&oldid=55244 |
21:45:24 | <h2ibot> | TriangleDemon edited Sketch (+46, Add data crawl): https://wiki.archiveteam.org/?diff=56670&oldid=49246 |
21:46:32 | | Larsenv quits [Remote host closed the connection] |
21:47:25 | <h2ibot> | TriangleDemon edited Karayou.com (+0, update): https://wiki.archiveteam.org/?diff=56671&oldid=50362 |
21:49:09 | | Larsenv (Larsenv) joins |
21:49:25 | <h2ibot> | TriangleDemon edited Colors! (-1): https://wiki.archiveteam.org/?diff=56672&oldid=56597 |
22:08:56 | | etnguyen03 quits [Remote host closed the connection] |
22:10:11 | | etnguyen03 (etnguyen03) joins |
22:16:42 | | Yakov joins |
22:18:04 | | SootBector quits [Remote host closed the connection] |
22:19:16 | | SootBector (SootBector) joins |
22:29:00 | | Yakov is now authenticated as Yakov |
22:29:00 | | Yakov quits [Changing host] |
22:29:00 | | Yakov (Yakov) joins |
22:36:01 | | cuphead2527480 quits [Client Quit] |
22:37:41 | | lunik1 quits [Quit: :x] |
22:38:13 | | lunik1 joins |
22:39:30 | | Dada quits [Remote host closed the connection] |
22:45:42 | | Wohlstand (Wohlstand) joins |
23:01:49 | | nicolas17_ joins |
23:03:54 | | nicolas17 quits [Ping timeout: 260 seconds] |
23:04:23 | | Hackerpcs quits [Quit: Hackerpcs] |
23:15:15 | | FiTheArchiver quits [Read error: Connection reset by peer] |
23:20:19 | | nicolas17_ is now known as nicolas17 |
23:27:28 | | cuphead2527480 (Cuphead2527480) joins |
23:44:54 | | UwU quits [Ping timeout: 240 seconds] |
23:47:00 | | UwU- joins |
23:47:28 | | etnguyen03 quits [Client Quit] |
23:50:19 | | hackbug quits [Remote host closed the connection] |
23:50:52 | | UwU- quits [Client Quit] |
23:53:04 | | hackbug (hackbug) joins |
23:54:39 | | UwU joins |
23:58:44 | | UwU quits [Client Quit] |
23:59:22 | | UwU joins |