00:01:19<Pedrosso>We (the spore community) haven't gotten a complete list of IDs yet. Archiving them one by one works but is inefficient. Known ID ranges exist but not where their "holes" are. 300000299348 - 300001258136; 500001000011 - 500999999998; 501000000016 - 501039850984;
00:02:45<thuban>how many items are there in your list?
00:05:07<thuban>(order-of-magnitude estimate is ok)
00:05:36<Pedrosso>Oh, alright approx 10^7
00:07:00<Pedrosso>2 * 10^7
00:08:31<Pedrosso>My question is if there are any tools that ArchiveTeam could/would use for this large list of small files (approx 20kB)
00:09:08<thuban>yes, there are--both archivebot and the urls project (https://wiki.archiveteam.org/index.php/URLs#How_to_help_if_you_have_lists_of_URLs) could do this. i'm just thinking about which would be more appropriate
00:11:54<Pedrosso>I see, I see
00:11:58<thuban>we usually prefer archivebot when retrieving large numbers of urls from a single site, because it offers better feedback and control of request speed (preventing accidental ddosing)
00:12:43lennier2 joins
00:13:31<Pedrosso>Hm, then what of speed?
00:14:26<Pedrosso>The only remaining factors would be AT's willingness to take the URLs on and the speed at which they're saved
00:16:05lennier2_ quits [Ping timeout: 272 seconds]
00:16:58<thuban>_but_ them all being small images is otherwise best-case, so it might be all right. or we could break them up into multiple lists. JAA, what do you think?
00:20:47<thuban>2e7 * 32K is about 600G, which afaik shouldn't be a problem, and speed is probably a question of what ea can/will tolerate and how long we're willing to take
00:21:35<thuban>...whoops, first response should have been:
00:21:39<thuban>well no, the other factor is that archivebot starts having issues if given very large lists, and 2e7 is on the high side.
00:28:42<Pedrosso>But you are saying that speed from ArchiveTeam's side is not a problem?
00:34:40<thuban>from what i recall of previous discussion, i suspect the spore servers would be the limiting factor, yeah.
00:35:12<Pedrosso>I see, alright. We're still busy getting all the viable links though
00:38:38<thuban>sounds good. by the way, did you get around to extracting the imgur links from the sporepedia2.foroactivo.com crawl like you mentioned?
00:40:31<flashfire42|m>I’ll look for some spore telegram groups too. I’ve just done a bunch more coupon ones to compliment the crypto stuff. And yes I do throw in good stuff too sometimes
00:44:07superkuh_ is now known as superkuh
00:52:10<@JAA>Pedrosso, thuban: 20M is feasible with AB, especially with images. They take little processing time, so they can run fast. It'd probably take on the order of 2 weeks.
00:53:30<@JAA>As was mentioned, #// doesn't work well for lists of single/few hosts due to the DDoS risk, and we get no real feedback over what happened to the URLs once they go in. It's very much a best-effort shotgun approach at the internet, not useful for targeted crawls.
00:59:18simon8162 quits [Quit: ZNC 1.8.2 - https://znc.in]
00:59:37simon816 (simon816) joins
01:04:05IDK quits [Client Quit]
01:12:58<Pedrosso>thuban: I got stumped at trying to get to the .warc's. Someone offered to do it and if they won't/can't I'll take it back up
01:13:23<Pedrosso>okay that was really unclear. I got stumped at trying to use them once downloaded
01:15:36<Pedrosso>It also appears we are way closer to finishing the list than I thought we were; it appears as though we're halfway but I'll have to confirm that with them
01:15:52<thuban>Pedrosso: no worries, i've just done it
01:16:03<Pedrosso>oh, sweet
01:16:05<Pedrosso>thank you
01:16:59<thuban>(for future reference, you can handle .warc.gz with anything that handles .gz--zless, zgrep, etc)
01:18:17<Pedrosso>Nice to say, but honestly, it's just really hard for me to wrap my head around anything that doesn't have an obvious "Click here to download exactly what you want" button and a GUI, haha.
01:18:40<Pedrosso>I still probably will want to look through .warc.gz:s in the future so, thanks
01:20:45<Pedrosso>Thanks again :)
01:23:13ThreeHM quits [Ping timeout: 272 seconds]
01:23:58ThreeHM (ThreeHeadedMonkey) joins
01:34:21<thuban>Pedrosso: you're welcome!
01:36:39<thuban>JAA: thanks!
01:38:47<thuban>also, idk if you saw the earlier discussion, but is it correct that we don't move third-party uploads to the archiveteam collection anymore and https://wiki.archiveteam.org/index.php/Frequently_Asked_Questions#halp_pls_halp should be updated?
01:41:24<Pedrosso>Yeah, that was quite confusing
01:48:47<TheTechRobo>I did an update to at least remove the part where it claims that
01:49:10<TheTechRobo>It probably deserves some rewording, though
01:49:21<h2ibot>TheTechRobo edited Frequently Asked Questions (-51, Temporary update to reduce confusion: AT…): https://wiki.archiveteam.org/?diff=51149&oldid=50785
01:50:14<thuban>yeah. i also noted when i went to link it just now that the entries aren't proper headings, just bold... there's a bunch of stuff to fix there
01:53:04ThreeHM quits [Ping timeout: 265 seconds]
01:53:28ThreeHM (ThreeHeadedMonkey) joins
01:53:36<@JAA>Yeah, the vast majority of that page is from almost a decade ago.
02:02:28HP_Archivist (HP_Archivist) joins
02:22:08Wohlstand quits [Client Quit]
02:31:38AlsoHP_Archivist joins
02:31:45HP_Archivist quits [Remote host closed the connection]
02:34:30AlsoHP_Archivist quits [Client Quit]
02:34:55simon816 quits [Client Quit]
02:35:09simon816 (simon816) joins
02:45:41<Pedrosso>Found a site with a lot of Spanish forums https://www.google.com/search?q=site%3Aforoactivo.com
02:49:46dumbgoy__ joins
02:50:03kiryu_ joins
02:53:47dumbgoy_ quits [Ping timeout: 272 seconds]
02:53:47kiryu__ quits [Ping timeout: 272 seconds]
03:14:29kiryu__ joins
03:14:29kiryu_ quits [Read error: Connection reset by peer]
03:20:55kiryu__ quits [Client Quit]
03:21:20kiryu joins
03:21:20kiryu quits [Changing host]
03:21:20kiryu (kiryu) joins
03:35:12katocala quits [Remote host closed the connection]
04:36:33DogsRNice_ quits [Read error: Connection reset by peer]
04:43:12dumbgoy__ quits [Ping timeout: 265 seconds]
04:46:19okay joins
04:46:30okay quits [Remote host closed the connection]
04:47:21BlueMaxima_ quits [Read error: Connection reset by peer]
04:47:49katocala joins
05:11:56Ruthalas59 quits [Quit: Ping timeout (120 seconds)]
05:12:13Ruthalas59 (Ruthalas) joins
05:14:11<h2ibot>Scarlett03 edited Deathwatch (+199, wilko aqquired by CDS Superstores): https://wiki.archiveteam.org/?diff=51150&oldid=51147
05:22:37etnguyen03 quits [Ping timeout: 272 seconds]
05:28:29simon8162 (simon816) joins
05:31:07Elizabeth (Elizabeth) joins
05:31:08colona_ joins
05:31:09Pedrosso quits [Client Quit]
05:31:09TheTechRobo quits [Client Quit]
05:31:09JensRex quits [Remote host closed the connection]
05:31:09h3ndr1k quits [Remote host closed the connection]
05:31:09wyatt8740 quits [Client Quit]
05:31:09colona quits [Remote host closed the connection]
05:31:09aismallard quits [Client Quit]
05:31:09simon816 quits [Client Quit]
05:31:09rewby1 (rewby) joins
05:31:09@ChanServ sets mode: +o rewby1
05:31:12@rewby quits [Remote host closed the connection]
05:31:12Eliz quits [Remote host closed the connection]
05:31:12Kitty quits [Remote host closed the connection]
05:31:12ThreeHM quits [Write error: Broken pipe]
05:31:12Hackerpcs quits [Remote host closed the connection]
05:31:13aismallard joins
05:31:16Pedrosso joins
05:31:17@rewby1 is now known as @rewby
05:31:18JensRex (JensRex) joins
05:31:22etnguyen03 (etnguyen03) joins
05:31:30ThreeHM (ThreeHeadedMonkey) joins
05:31:43TheTechRobo (TheTechRobo) joins
05:31:46Hackerpcs (Hackerpcs) joins
05:31:58Kitty (Kitty) joins
05:32:51h3ndr1k (h3ndr1k) joins
05:40:31wyatt8740 joins
05:48:17etnguyen03 quits [Client Quit]
05:59:27<Ryz>Yo, regarding https://wiki.archiveteam.org/index.php/Deathwatch - if there's a bunch of entries in https://wiki.archiveteam.org/index.php/Deathwatch#Pining_for_the_Fjords_(Dying) - shouldn't the ones that died already move to https://wiki.archiveteam.org/index.php/Deathwatch#Dead_as_a_Doornail ?
06:02:06<thuban>yes, it just sometimes takes a while for someone to get around to it (esp when sites don't actually shut down on the announced schedule)
06:03:08<@JAA>Yeah, I moved a bunch the other day, several of which had been dead for months.
06:03:23<h2ibot>Ryz edited Deathwatch (+243, /* 2024 */ Add GameBattles): https://wiki.archiveteam.org/?diff=51151&oldid=51150
06:11:00<Ryz>Hmm...I'm not sure if ArchiveBot can handle archiving stuff like https://gamebattles.majorleaguegaming.com/tournaments :/
06:17:09ScenarioPlanet quits [Client Quit]
06:52:37Island_ quits [Read error: Connection reset by peer]
07:03:02benjinsmi quits [Read error: Connection reset by peer]
07:10:33<Flashfire42>israel just stormed Gaza hospital
07:29:23<Ryz>Ah yes, killing the website version of Comixology, and then finally killing off the app version so it can merge into Kindle, wow Amazon :/ - https://www.comicsbeat.com/comixology-app-merges-with-the-kindle-app-at-amazon/
07:58:10Arcorann (Arcorann) joins
08:10:42decky_e_ quits [Read error: Connection reset by peer]
08:12:21Pedrosso quits [Client Quit]
08:12:21TheTechRobo quits [Client Quit]
08:12:25Pedrosso joins
08:12:48TheTechRobo (TheTechRobo) joins
08:15:01TheTechRobo quits [Excess Flood]
08:16:37TheTechRobo (TheTechRobo) joins
08:38:01<h2ibot>JustAnotherArchivist edited Deathwatch (+52, /* 2023 */ Add LARM.fm): https://wiki.archiveteam.org/?diff=51152&oldid=51151
09:08:51decky_e joins
09:55:04DigitalDragons quits [Quit: Leaving]
10:01:25Bleo1 joins
10:18:24DigitalDragons (DigitalDragons) joins
10:30:44Peroniko (Peroniko) joins
10:52:15Bleo18 joins
10:52:15thehedgeh0g quits [Ping timeout: 265 seconds]
10:52:15wyatt8740 quits [Client Quit]
10:52:15Bleo1 quits [Client Quit]
10:52:15Bleo18 is now known as Bleo1
10:52:15DigitalDragons quits [Client Quit]
10:52:18thehedgeh0g (mrHedgehog0) joins
10:53:45qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
10:53:50kiryu_ joins
10:54:17wyatt8740 joins
10:56:27qwertyasdfuiopghjkl quits [Excess Flood]
10:56:27Bleo1 quits [Client Quit]
10:56:27TheTechRobo quits [Client Quit]
10:56:27Pedrosso quits [Client Quit]
10:56:27Peroniko quits [Remote host closed the connection]
10:56:30Pedrosso joins
10:56:35Bleo1 joins
10:56:39Peroniko (Peroniko) joins
10:57:30TheTechRobo (TheTechRobo) joins
10:57:30TheTechRobo quits [Excess Flood]
10:57:30Peroniko quits [Remote host closed the connection]
10:57:51Peroniko (Peroniko) joins
10:58:07TheTechRobo (TheTechRobo) joins
10:58:48qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
10:58:48kiryu quits [Ping timeout: 263 seconds]
10:58:59TheTechRobo quits [Excess Flood]
11:15:00benjins joins
11:16:18benjinsm joins
11:16:26kiryu__ joins
11:16:56lennier2_ joins
11:18:26qwertyasdfuiopghjkl quits [Remote host closed the connection]
11:19:03kiryu_ quits [Ping timeout: 257 seconds]
11:19:32Peroniko quits [Ping timeout: 265 seconds]
11:19:32lennier2 quits [Ping timeout: 265 seconds]
11:20:51Peroniko (Peroniko) joins
11:21:28benjins quits [Ping timeout: 265 seconds]
11:28:14Peroniko quits [Ping timeout: 265 seconds]
12:39:04icedice (icedice) joins
12:48:29Arcorann quits [Ping timeout: 272 seconds]
12:55:13SF quits [Remote host closed the connection]
12:56:59TheTechRobo (TheTechRobo) joins
13:02:30bf_ quits [Remote host closed the connection]
13:12:25Wohlstand (Wohlstand) joins
13:12:56bf_ joins
13:13:21bf_ quits [Remote host closed the connection]
13:13:24bf_ joins
13:54:01qwertyasdfuiopghjkl (qwertyasdfuiopghjkl) joins
13:55:34bf_ quits [Remote host closed the connection]
13:55:49DigitalDragons (DigitalDragons) joins
14:33:16etnguyen03 (etnguyen03) joins
14:45:25Wohlstand quits [Client Quit]
15:01:29dave3 quits [Ping timeout: 272 seconds]
15:01:31dumbgoy__ joins
15:01:52dave3 (dave) joins
15:06:46BearFortress joins
15:10:21BearFortress_ quits [Ping timeout: 272 seconds]
16:05:20Island joins
16:16:11rohvani joins
16:44:05kiryu__ quits [Remote host closed the connection]
16:51:25kiryu joins
16:51:25kiryu quits [Changing host]
16:51:25kiryu (kiryu) joins
17:22:36<tomodachi94>I would appreciate it if someone would grab "haughey.com". This user posted that their blog would be shutdown in 60 days: https://xoxo.zone/@mathowie/111415557908672738
17:22:56<tomodachi94>(Can't find any AB jobs: https://archive.fart.website/archivebot/viewer/?q=haughey.com)
17:24:35<tomodachi94>Unsure if they are going to save the blog or not, but better safe than sorry ig?
17:26:39DogsRNice joins
17:40:57atphoenix_ (atphoenix) joins
17:43:37atphoenix__ quits [Ping timeout: 272 seconds]
17:55:30eggdrop leaves
17:56:48eggdrop (eggdrop) joins
18:00:44lunik173 quits [Quit: :x]
18:22:42<@JAA>Weird, that isn't even a Blogger blog as far as I can see. Maybe it was in the past.
18:26:07icedice2 (icedice) joins
18:26:17Bleo18 joins
18:26:18Pedrosso quits [Client Quit]
18:26:18TheTechRobo quits [Client Quit]
18:26:18Bleo1 quits [Client Quit]
18:26:18rohvani quits [Client Quit]
18:26:18icedice quits [Remote host closed the connection]
18:26:18DogsRNice quits [Remote host closed the connection]
18:26:19Bleo18 is now known as Bleo1
18:26:20Pedrosso joins
18:26:22rohvani joins
18:26:23DogsRNice joins
18:26:42TheTechRobo (TheTechRobo) joins
18:26:46<@JAA>Running
18:30:48<tomodachi94>Appreciated
18:31:02<fireonlive>JAA++
18:31:02<eggdrop>[karma] 'JAA' now has 3 karma!
18:31:09<@JAA>Also #frogger for the upcoming Blogger project
18:34:24taewoogf joins
18:35:33etnguyen03 quits [Ping timeout: 272 seconds]
18:36:21taewoogf quits [Remote host closed the connection]
18:44:12etnguyen03 (etnguyen03) joins
19:12:55etnguyen03 quits [Ping timeout: 272 seconds]
19:29:16<Ryz>arkiver and others, a reminder on not only Blogger stuff, it's also Google Docs and other goodies like Google Photos; rather curious it's not YouTube, though probably a specific reason >;o
19:37:54<vokunal|m>Frogger might want to keep track of urls to those other services in the blogs and potentially send them to another project as well. Is google drive in the burnpile? might be a good idea to get #googlecrash back online if so
19:40:42<@JAA>As I understand it, everything associated with the inactive accounts is getting shredded.
19:42:51CraftByte (DragonSec|CraftByte) joins
19:46:26vokunal|m uploaded an image: (903KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/aNKbhbTkcFxWHGUKjbcuwLIT/image.png >
19:46:27<vokunal|m>It's so fun watching these things work. Probably a bit inefficient having the mdisplay every single line, but it's nice to watch
19:56:15lunik173 joins
20:06:51lunik173 quits [Ping timeout: 265 seconds]
20:28:55le0n_ quits [Ping timeout: 272 seconds]
20:32:31le0n (le0n) joins
20:36:57etnguyen03 (etnguyen03) joins
20:52:02nicolas17 joins
21:15:54<@arkiver>yeah Ryz
21:28:07BlueMaxima joins
21:28:45dumbgoy joins
21:32:15dumbgoy__ quits [Ping timeout: 272 seconds]
21:40:33dumbgoy_ joins
21:43:31dumbgoy quits [Ping timeout: 265 seconds]
21:45:24Wohlstand (Wohlstand) joins
21:54:48dumbgoy__ joins
21:57:32dumbgoy_ quits [Ping timeout: 265 seconds]
22:00:59<Pedrosso>I've gathered a fairly extensive but not complete list of old and dying or thriving but niche Spore-related forums https://transfer.archivete.am/J2GVQ/sporeforums1.txt
22:29:06Peroniko (Peroniko) joins
22:29:07Peroniko quits [Max SendQ exceeded]
22:29:28Peroniko (Peroniko) joins
22:29:29Peroniko quits [Max SendQ exceeded]
22:29:55Peroniko (Peroniko) joins
22:31:50Peroniko quits [Client Quit]
22:32:07Peroniko (Peroniko) joins
22:32:08Peroniko quits [Max SendQ exceeded]
22:32:34Peroniko (Peroniko) joins
22:51:23<Peroniko>Copied from #archiveteam-ot: I want to archive a few hundred historical documents from the local library (books, newspapers...). The problem is that they can't be archived using Wayback Machine because each image is loaded using javascript and the links to those images aren't loaded in a way that IA can capture them. The names of the images are available in the source code of the each book (for example: https://www.old.dlib.me/sken_prikaz_1_f.php
22:51:23<Peroniko>?id_jedinice=1034) and will show that the images are loaded from lista_skenova section and that they exist in skenovi/nj-gorski-vijenac-engleski folder under the base url. Folder name is different for each document. Did anyone else encounter this type of library preview because I think I've seen it before. I would also like to convert all of those books to pdf and upload them to IA separately. I've began downloading this manually using some
22:51:25<Peroniko>basic scripts and wget, but there is about 1500 pages of this and it would be too labor intensive to continue like that.
23:03:27Matthww119 quits [Ping timeout: 272 seconds]
23:05:59<thuban>Peroniko: interesting, i will take a look at this and get back to you in a bit. are the available documents just the ones under the https://www.old.dlib.me/petarpetrovic2njegos/ collection, or is there more?
23:06:30<Peroniko>There are other here https://www.old.dlib.me/
23:06:51<Peroniko>book, manuscripts, photos, maps..
23:09:07DogsRNice_ joins
23:09:31<thuban>oh, my mistake! i saw those but didn't see that they were browsable
23:09:31<thuban>(the link isn't clearly indicated and the "english" site mostly isn't...)
23:09:31rohvani quits [Client Quit]
23:09:31DogsRNice quits [Remote host closed the connection]
23:09:31Peroniko quits [Remote host closed the connection]
23:09:31TheTechRobo quits [Client Quit]
23:09:31Pedrosso quits [Client Quit]
23:09:31CraftByte quits [Client Quit]
23:09:33rohvani joins
23:09:35Pedrosso joins
23:09:36CraftByte (DragonSec|CraftByte) joins
23:09:40Peroniko (Peroniko) joins
23:09:49lunik173 joins
23:09:57TheTechRobo (TheTechRobo) joins
23:13:23<thuban>i think it should be possible to get the documents to work in the wayback machine
23:20:33etnguyen03 quits [Ping timeout: 272 seconds]
23:25:48<Peroniko>I've made this script to download. Seems to work but not yet fully tested. https://gist.github.com/Fooftilly/52793337319782576ad57fc01cbbb312
23:28:49etnguyen03 (etnguyen03) joins
23:31:09<thuban>bad ids don't result in 404s, unfortunately