00:01:19 | <Pedrosso> | We (the spore community) haven't gotten a complete list of IDs yet. Archiving them one by one works but is inefficient. Known ID ranges exist but not where their "holes" are. 300000299348 - 300001258136; 500001000011 - 500999999998; 501000000016 - 501039850984; |
00:02:45 | <thuban> | how many items are there in your list? |
00:05:07 | <thuban> | (order-of-magnitude estimate is ok) |
00:05:36 | <Pedrosso> | Oh, alright approx 10^7 |
00:07:00 | <Pedrosso> | 2 * 10^7 |
00:08:31 | <Pedrosso> | My question is if there are any tools that ArchiveTeam could/would use for this large list of small files (approx 20kB) |
00:09:08 | <thuban> | yes, there are--both archivebot and the urls project (https://wiki.archiveteam.org/index.php/URLs#How_to_help_if_you_have_lists_of_URLs) could do this. i'm just thinking about which would be more appropriate |
00:11:54 | <Pedrosso> | I see, I see |
00:11:58 | <thuban> | we usually prefer archivebot when retrieving large numbers of urls from a single site, because it offers better feedback and control of request speed (preventing accidental ddosing) |
00:12:43 | | lennier2 joins |
00:13:31 | <Pedrosso> | Hm, then what of speed? |
00:14:26 | <Pedrosso> | The only remaining factors would be AT's willingness to take the URLs on and the speed at which they're saved |
00:16:05 | | lennier2_ quits [Ping timeout: 272 seconds] |
00:16:58 | <thuban> | _but_ them all being small images is otherwise best-case, so it might be all right. or we could break them up into multiple lists. JAA, what do you think? |
00:20:47 | <thuban> | 2e7 * 32K is about 600G, which afaik shouldn't be a problem, and speed is probably a question of what ea can/will tolerate and how long we're willing to take |
00:21:35 | <thuban> | ...whoops, first response should have been: |
00:21:39 | <thuban> | well no, the other factor is that archivebot starts having issues if given very large lists, and 2e7 is on the high side. |
00:28:42 | <Pedrosso> | But you are saying that speed from ArchiveTeam's side is not a problem? |
00:34:40 | <thuban> | from what i recall of previous discussion, i suspect the spore servers would be the limiting factor, yeah. |
00:35:12 | <Pedrosso> | I see, alright. We're still busy getting all the viable links though |
00:38:38 | <thuban> | sounds good. by the way, did you get around to extracting the imgur links from the sporepedia2.foroactivo.com crawl like you mentioned? |
00:40:31 | <flashfire42|m> | I’ll look for some spore telegram groups too. I’ve just done a bunch more coupon ones to compliment the crypto stuff. And yes I do throw in good stuff too sometimes |
00:44:07 | | superkuh_ is now known as superkuh |
00:52:10 | <@JAA> | Pedrosso, thuban: 20M is feasible with AB, especially with images. They take little processing time, so they can run fast. It'd probably take on the order of 2 weeks. |
00:53:30 | <@JAA> | As was mentioned, #// doesn't work well for lists of single/few hosts due to the DDoS risk, and we get no real feedback over what happened to the URLs once they go in. It's very much a best-effort shotgun approach at the internet, not useful for targeted crawls. |
00:59:18 | | simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in] |
00:59:37 | | simon816 (simon816) joins |
01:04:05 | | IDK quits [Client Quit] |
01:12:58 | <Pedrosso> | thuban: I got stumped at trying to get to the .warc's. Someone offered to do it and if they won't/can't I'll take it back up |
01:13:23 | <Pedrosso> | okay that was really unclear. I got stumped at trying to use them once downloaded |
01:15:36 | <Pedrosso> | It also appears we are way closer to finishing the list than I thought we were; it appears as though we're halfway but I'll have to confirm that with them |
01:15:52 | <thuban> | Pedrosso: no worries, i've just done it |
01:16:03 | <Pedrosso> | oh, sweet |
01:16:05 | <Pedrosso> | thank you |
01:16:59 | <thuban> | (for future reference, you can handle .warc.gz with anything that handles .gz--zless, zgrep, etc) |
01:18:17 | <Pedrosso> | Nice to say, but honestly, it's just really hard for me to wrap my head around anything that doesn't have an obvious "Click here to download exactly what you want" button and a GUI, haha. |
01:18:40 | <Pedrosso> | I still probably will want to look through .warc.gz:s in the future so, thanks |
01:20:45 | <Pedrosso> | Thanks again :) |
01:23:13 | | ThreeHM quits [Ping timeout: 272 seconds] |
01:23:58 | | ThreeHM (ThreeHeadedMonkey) joins |
01:34:21 | <thuban> | Pedrosso: you're welcome! |
01:36:39 | <thuban> | JAA: thanks! |
01:38:47 | <thuban> | also, idk if you saw the earlier discussion, but is it correct that we don't move third-party uploads to the archiveteam collection anymore and https://wiki.archiveteam.org/index.php/Frequently_Asked_Questions#halp_pls_halp should be updated? |
01:41:24 | <Pedrosso> | Yeah, that was quite confusing |
01:48:47 | <TheTechRobo> | I did an update to at least remove the part where it claims that |
01:49:10 | <TheTechRobo> | It probably deserves some rewording, though |
01:49:21 | <h2ibot> | TheTechRobo edited Frequently Asked Questions (-51, Temporary update to reduce confusion: AT…): https://wiki.archiveteam.org/?diff=51149&oldid=50785 |
01:50:14 | <thuban> | yeah. i also noted when i went to link it just now that the entries aren't proper headings, just bold... there's a bunch of stuff to fix there |
01:53:04 | | ThreeHM quits [Ping timeout: 265 seconds] |
01:53:28 | | ThreeHM (ThreeHeadedMonkey) joins |
01:53:36 | <@JAA> | Yeah, the vast majority of that page is from almost a decade ago. |
02:02:28 | | HP_Archivist (HP_Archivist) joins |
02:22:08 | | Wohlstand quits [Client Quit] |
02:31:38 | | AlsoHP_Archivist joins |
02:31:45 | | HP_Archivist quits [Remote host closed the connection] |
02:34:30 | | AlsoHP_Archivist quits [Client Quit] |
02:34:55 | | simon816 quits [Client Quit] |
02:35:09 | | simon816 (simon816) joins |
02:45:41 | <Pedrosso> | Found a site with a lot of Spanish forums https://www.google.com/search?q=site%3Aforoactivo.com |
02:49:46 | | dumbgoy__ joins |
02:50:03 | | kiryu_ joins |
02:53:47 | | dumbgoy_ quits [Ping timeout: 272 seconds] |
02:53:47 | | kiryu__ quits [Ping timeout: 272 seconds] |
03:14:29 | | kiryu__ joins |
03:14:29 | | kiryu_ quits [Read error: Connection reset by peer] |
03:20:55 | | kiryu__ quits [Client Quit] |
03:21:20 | | kiryu joins |
03:21:20 | | kiryu is now authenticated as kiryu |
03:21:20 | | kiryu quits [Changing host] |
03:21:20 | | kiryu (kiryu) joins |
03:35:12 | | katocala quits [Remote host closed the connection] |
04:36:33 | | DogsRNice_ quits [Read error: Connection reset by peer] |
04:43:12 | | dumbgoy__ quits [Ping timeout: 265 seconds] |
04:46:19 | | okay joins |
04:46:30 | | okay quits [Remote host closed the connection] |
04:47:21 | | BlueMaxima_ quits [Read error: Connection reset by peer] |
04:47:49 | | katocala joins |
04:48:06 | | katocala is now authenticated as katocala |
05:11:56 | | Ruthalas59 quits [Quit: Ping timeout (120 seconds)] |
05:12:13 | | Ruthalas59 (Ruthalas) joins |
05:14:11 | <h2ibot> | Scarlett03 edited Deathwatch (+199, wilko aqquired by CDS Superstores): https://wiki.archiveteam.org/?diff=51150&oldid=51147 |
05:22:37 | | etnguyen03 quits [Ping timeout: 272 seconds] |
05:28:29 | | simon8162 (simon816) joins |
05:31:07 | | Elizabeth (Elizabeth) joins |
05:31:08 | | colona_ joins |
05:31:09 | | Pedrosso quits [Client Quit] |
05:31:09 | | TheTechRobo quits [Client Quit] |
05:31:09 | | JensRex quits [Remote host closed the connection] |
05:31:09 | | h3ndr1k quits [Remote host closed the connection] |
05:31:09 | | wyatt8740 quits [Client Quit] |
05:31:09 | | colona quits [Remote host closed the connection] |
05:31:09 | | aismallard quits [Client Quit] |
05:31:09 | | simon816 quits [Client Quit] |
05:31:09 | | rewby1 (rewby) joins |
05:31:09 | | @ChanServ sets mode: +o rewby1 |
05:31:12 | | @rewby quits [Remote host closed the connection] |
05:31:12 | | Eliz quits [Remote host closed the connection] |
05:31:12 | | Kitty quits [Remote host closed the connection] |
05:31:12 | | ThreeHM quits [Write error: Broken pipe] |
05:31:12 | | Hackerpcs quits [Remote host closed the connection] |
05:31:13 | | aismallard joins |
05:31:16 | | Pedrosso joins |
05:31:17 | | @rewby1 is now known as @rewby |
05:31:18 | | JensRex (JensRex) joins |
05:31:22 | | etnguyen03 (etnguyen03) joins |
05:31:30 | | ThreeHM (ThreeHeadedMonkey) joins |
05:31:43 | | TheTechRobo (TheTechRobo) joins |
05:31:46 | | Hackerpcs (Hackerpcs) joins |
05:31:58 | | Kitty (Kitty) joins |
05:32:51 | | h3ndr1k (h3ndr1k) joins |
05:40:31 | | wyatt8740 joins |
05:48:17 | | etnguyen03 quits [Client Quit] |
05:59:27 | <Ryz> | Yo, regarding https://wiki.archiveteam.org/index.php/Deathwatch - if there's a bunch of entries in https://wiki.archiveteam.org/index.php/Deathwatch#Pining_for_the_Fjords_(Dying) - shouldn't the ones that died already move to https://wiki.archiveteam.org/index.php/Deathwatch#Dead_as_a_Doornail ? |
06:02:06 | <thuban> | yes, it just sometimes takes a while for someone to get around to it (esp when sites don't actually shut down on the announced schedule) |
06:03:08 | <@JAA> | Yeah, I moved a bunch the other day, several of which had been dead for months. |
06:03:23 | <h2ibot> | Ryz edited Deathwatch (+243, /* 2024 */ Add GameBattles): https://wiki.archiveteam.org/?diff=51151&oldid=51150 |
06:11:00 | <Ryz> | Hmm...I'm not sure if ArchiveBot can handle archiving stuff like https://gamebattles.majorleaguegaming.com/tournaments :/ |
06:17:09 | | ScenarioPlanet quits [Client Quit] |
06:52:37 | | Island_ quits [Read error: Connection reset by peer] |
07:03:02 | | benjinsmi quits [Read error: Connection reset by peer] |
07:10:33 | <Flashfire42> | israel just stormed Gaza hospital |
07:29:23 | <Ryz> | Ah yes, killing the website version of Comixology, and then finally killing off the app version so it can merge into Kindle, wow Amazon :/ - https://www.comicsbeat.com/comixology-app-merges-with-the-kindle-app-at-amazon/ |
07:58:10 | | Arcorann (Arcorann) joins |
08:10:42 | | decky_e_ quits [Read error: Connection reset by peer] |
08:12:21 | | Pedrosso quits [Client Quit] |
08:12:21 | | TheTechRobo quits [Client Quit] |
08:12:25 | | Pedrosso joins |
08:12:48 | | TheTechRobo (TheTechRobo) joins |
08:15:01 | | TheTechRobo quits [Excess Flood] |
08:16:37 | | TheTechRobo (TheTechRobo) joins |
08:38:01 | <h2ibot> | JustAnotherArchivist edited Deathwatch (+52, /* 2023 */ Add LARM.fm): https://wiki.archiveteam.org/?diff=51152&oldid=51151 |
09:08:51 | | decky_e joins |
09:55:04 | | DigitalDragons quits [Quit: Leaving] |
10:01:25 | | Bleo1 joins |
10:18:24 | | DigitalDragons (DigitalDragons) joins |
10:30:44 | | Peroniko (Peroniko) joins |
10:52:15 | | Bleo18 joins |
10:52:15 | | thehedgeh0g quits [Ping timeout: 265 seconds] |
10:52:15 | | wyatt8740 quits [Client Quit] |
10:52:15 | | Bleo1 quits [Client Quit] |
10:52:15 | | Bleo18 is now known as Bleo1 |
10:52:15 | | DigitalDragons quits [Client Quit] |
10:52:18 | | thehedgeh0g (mrHedgehog0) joins |
10:53:45 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
10:53:50 | | kiryu_ joins |
10:54:17 | | wyatt8740 joins |
10:56:27 | | qwertyasdfuiopghjkl quits [Excess Flood] |
10:56:27 | | Bleo1 quits [Client Quit] |
10:56:27 | | TheTechRobo quits [Client Quit] |
10:56:27 | | Pedrosso quits [Client Quit] |
10:56:27 | | Peroniko quits [Remote host closed the connection] |
10:56:30 | | Pedrosso joins |
10:56:35 | | Bleo1 joins |
10:56:39 | | Peroniko (Peroniko) joins |
10:57:30 | | TheTechRobo (TheTechRobo) joins |
10:57:30 | | TheTechRobo quits [Excess Flood] |
10:57:30 | | Peroniko quits [Remote host closed the connection] |
10:57:51 | | Peroniko (Peroniko) joins |
10:58:07 | | TheTechRobo (TheTechRobo) joins |
10:58:48 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
10:58:48 | | kiryu quits [Ping timeout: 263 seconds] |
10:58:59 | | TheTechRobo quits [Excess Flood] |
11:15:00 | | benjins joins |
11:16:18 | | benjinsm joins |
11:16:26 | | kiryu__ joins |
11:16:56 | | lennier2_ joins |
11:18:26 | | qwertyasdfuiopghjkl quits [Remote host closed the connection] |
11:19:03 | | kiryu_ quits [Ping timeout: 257 seconds] |
11:19:32 | | Peroniko quits [Ping timeout: 265 seconds] |
11:19:32 | | lennier2 quits [Ping timeout: 265 seconds] |
11:20:51 | | Peroniko (Peroniko) joins |
11:21:28 | | benjins quits [Ping timeout: 265 seconds] |
11:28:14 | | Peroniko quits [Ping timeout: 265 seconds] |
12:39:04 | | icedice (icedice) joins |
12:48:29 | | Arcorann quits [Ping timeout: 272 seconds] |
12:55:13 | | SF quits [Remote host closed the connection] |
12:56:59 | | TheTechRobo (TheTechRobo) joins |
13:02:30 | | bf_ quits [Remote host closed the connection] |
13:12:25 | | Wohlstand (Wohlstand) joins |
13:12:56 | | bf_ joins |
13:13:21 | | bf_ quits [Remote host closed the connection] |
13:13:24 | | bf_ joins |
13:54:01 | | qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins |
13:55:34 | | bf_ quits [Remote host closed the connection] |
13:55:49 | | DigitalDragons (DigitalDragons) joins |
14:33:16 | | etnguyen03 (etnguyen03) joins |
14:45:25 | | Wohlstand quits [Client Quit] |
15:01:29 | | dave3 quits [Ping timeout: 272 seconds] |
15:01:31 | | dumbgoy__ joins |
15:01:52 | | dave3 (dave) joins |
15:06:46 | | BearFortress joins |
15:10:21 | | BearFortress_ quits [Ping timeout: 272 seconds] |
16:05:20 | | Island joins |
16:16:11 | | rohvani joins |
16:44:05 | | kiryu__ quits [Remote host closed the connection] |
16:51:25 | | kiryu joins |
16:51:25 | | kiryu is now authenticated as kiryu |
16:51:25 | | kiryu quits [Changing host] |
16:51:25 | | kiryu (kiryu) joins |
17:22:36 | <tomodachi94> | I would appreciate it if someone would grab "haughey.com". This user posted that their blog would be shutdown in 60 days: https://xoxo.zone/@mathowie/111415557908672738 |
17:22:56 | <tomodachi94> | (Can't find any AB jobs: https://archive.fart.website/archivebot/viewer/?q=haughey.com) |
17:24:35 | <tomodachi94> | Unsure if they are going to save the blog or not, but better safe than sorry ig? |
17:26:39 | | DogsRNice joins |
17:40:57 | | atphoenix_ (atphoenix) joins |
17:43:37 | | atphoenix__ quits [Ping timeout: 272 seconds] |
17:55:30 | | eggdrop leaves |
17:56:48 | | eggdrop (eggdrop) joins |
18:00:44 | | lunik173 quits [Quit: :x] |
18:22:42 | <@JAA> | Weird, that isn't even a Blogger blog as far as I can see. Maybe it was in the past. |
18:26:07 | | icedice2 (icedice) joins |
18:26:17 | | Bleo18 joins |
18:26:18 | | Pedrosso quits [Client Quit] |
18:26:18 | | TheTechRobo quits [Client Quit] |
18:26:18 | | Bleo1 quits [Client Quit] |
18:26:18 | | rohvani quits [Client Quit] |
18:26:18 | | icedice quits [Remote host closed the connection] |
18:26:18 | | DogsRNice quits [Remote host closed the connection] |
18:26:19 | | Bleo18 is now known as Bleo1 |
18:26:20 | | Pedrosso joins |
18:26:22 | | rohvani joins |
18:26:23 | | DogsRNice joins |
18:26:42 | | TheTechRobo (TheTechRobo) joins |
18:26:46 | <@JAA> | Running |
18:30:48 | <tomodachi94> | Appreciated |
18:31:02 | <fireonlive> | JAA++ |
18:31:02 | <eggdrop> | [karma] 'JAA' now has 3 karma! |
18:31:09 | <@JAA> | Also #frogger for the upcoming Blogger project |
18:34:24 | | taewoogf joins |
18:35:33 | | etnguyen03 quits [Ping timeout: 272 seconds] |
18:36:21 | | taewoogf quits [Remote host closed the connection] |
18:44:12 | | etnguyen03 (etnguyen03) joins |
19:12:55 | | etnguyen03 quits [Ping timeout: 272 seconds] |
19:29:16 | <Ryz> | arkiver and others, a reminder on not only Blogger stuff, it's also Google Docs and other goodies like Google Photos; rather curious it's not YouTube, though probably a specific reason >;o |
19:37:54 | <vokunal|m> | Frogger might want to keep track of urls to those other services in the blogs and potentially send them to another project as well. Is google drive in the burnpile? might be a good idea to get #googlecrash back online if so |
19:40:42 | <@JAA> | As I understand it, everything associated with the inactive accounts is getting shredded. |
19:42:51 | | CraftByte (DragonSec|CraftByte) joins |
19:46:26 | | vokunal|m uploaded an image: (903KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/aNKbhbTkcFxWHGUKjbcuwLIT/image.png > |
19:46:27 | <vokunal|m> | It's so fun watching these things work. Probably a bit inefficient having the mdisplay every single line, but it's nice to watch |
19:56:15 | | lunik173 joins |
20:06:51 | | lunik173 quits [Ping timeout: 265 seconds] |
20:28:55 | | le0n_ quits [Ping timeout: 272 seconds] |
20:32:31 | | le0n (le0n) joins |
20:36:57 | | etnguyen03 (etnguyen03) joins |
20:52:02 | | nicolas17 joins |
21:15:54 | <@arkiver> | yeah Ryz |
21:28:07 | | BlueMaxima joins |
21:28:45 | | dumbgoy joins |
21:32:15 | | dumbgoy__ quits [Ping timeout: 272 seconds] |
21:40:33 | | dumbgoy_ joins |
21:43:31 | | dumbgoy quits [Ping timeout: 265 seconds] |
21:45:24 | | Wohlstand (Wohlstand) joins |
21:54:48 | | dumbgoy__ joins |
21:57:32 | | dumbgoy_ quits [Ping timeout: 265 seconds] |
22:00:59 | <Pedrosso> | I've gathered a fairly extensive but not complete list of old and dying or thriving but niche Spore-related forums https://transfer.archivete.am/J2GVQ/sporeforums1.txt |
22:29:06 | | Peroniko (Peroniko) joins |
22:29:07 | | Peroniko quits [Max SendQ exceeded] |
22:29:28 | | Peroniko (Peroniko) joins |
22:29:29 | | Peroniko quits [Max SendQ exceeded] |
22:29:55 | | Peroniko (Peroniko) joins |
22:31:50 | | Peroniko quits [Client Quit] |
22:32:07 | | Peroniko (Peroniko) joins |
22:32:08 | | Peroniko quits [Max SendQ exceeded] |
22:32:34 | | Peroniko (Peroniko) joins |
22:51:23 | <Peroniko> | Copied from #archiveteam-ot: I want to archive a few hundred historical documents from the local library (books, newspapers...). The problem is that they can't be archived using Wayback Machine because each image is loaded using javascript and the links to those images aren't loaded in a way that IA can capture them. The names of the images are available in the source code of the each book (for example: https://www.old.dlib.me/sken_prikaz_1_f.php |
22:51:23 | <Peroniko> | ?id_jedinice=1034) and will show that the images are loaded from lista_skenova section and that they exist in skenovi/nj-gorski-vijenac-engleski folder under the base url. Folder name is different for each document. Did anyone else encounter this type of library preview because I think I've seen it before. I would also like to convert all of those books to pdf and upload them to IA separately. I've began downloading this manually using some |
22:51:25 | <Peroniko> | basic scripts and wget, but there is about 1500 pages of this and it would be too labor intensive to continue like that. |
23:03:27 | | Matthww119 quits [Ping timeout: 272 seconds] |
23:05:59 | <thuban> | Peroniko: interesting, i will take a look at this and get back to you in a bit. are the available documents just the ones under the https://www.old.dlib.me/petarpetrovic2njegos/ collection, or is there more? |
23:06:30 | <Peroniko> | There are other here https://www.old.dlib.me/ |
23:06:51 | <Peroniko> | book, manuscripts, photos, maps.. |
23:09:07 | | DogsRNice_ joins |
23:09:31 | <thuban> | oh, my mistake! i saw those but didn't see that they were browsable |
23:09:31 | <thuban> | (the link isn't clearly indicated and the "english" site mostly isn't...) |
23:09:31 | | rohvani quits [Client Quit] |
23:09:31 | | DogsRNice quits [Remote host closed the connection] |
23:09:31 | | Peroniko quits [Remote host closed the connection] |
23:09:31 | | TheTechRobo quits [Client Quit] |
23:09:31 | | Pedrosso quits [Client Quit] |
23:09:31 | | CraftByte quits [Client Quit] |
23:09:33 | | rohvani joins |
23:09:35 | | Pedrosso joins |
23:09:36 | | CraftByte (DragonSec|CraftByte) joins |
23:09:40 | | Peroniko (Peroniko) joins |
23:09:49 | | lunik173 joins |
23:09:57 | | TheTechRobo (TheTechRobo) joins |
23:13:23 | <thuban> | i think it should be possible to get the documents to work in the wayback machine |
23:20:33 | | etnguyen03 quits [Ping timeout: 272 seconds] |
23:25:48 | <Peroniko> | I've made this script to download. Seems to work but not yet fully tested. https://gist.github.com/Fooftilly/52793337319782576ad57fc01cbbb312 |
23:28:49 | | etnguyen03 (etnguyen03) joins |
23:31:09 | <thuban> | bad ids don't result in 404s, unfortunately |