| 00:00:20 | | lennier1 (lennier1) joins |
| 00:06:28 | | railen69 quits [Remote host closed the connection] |
| 00:09:35 | | Aoede quits [Ping timeout: 252 seconds] |
| 00:09:46 | | Aoede_ (Aoede) joins |
| 00:12:26 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 00:12:35 | | BlueMaxima joins |
| 00:15:19 | | Dallas (Dallas) joins |
| 00:21:35 | | railen63 joins |
| 00:56:49 | | Hajdar quits [Remote host closed the connection] |
| 01:00:46 | | Hajdar (Hajdar) joins |
| 01:13:37 | <@arkiver> | anyone have a channel name idea for "Trove"? see Deatchwatch |
| 01:14:30 | <flashfire42> | https://cohost.org/lilrawk/post/1704328-blogger-they-re-kil |
| 01:15:38 | <flashfire42> | TrovePlundering arkiver |
| 01:15:46 | <flashfire42> | I dunno something along the lines of treasure trove |
| 01:16:21 | <flashfire42> | TreasureTheTrashedTrove |
| 01:16:49 | <flashfire42> | TreasureTheTrashedTrove could work. Its getting trashed and we will treasure what we grab |
| 01:17:06 | <flashfire42> | #treasurethetrashedtrove |
| 01:20:10 | <pokechu22> | https://viaf.org/viaf/partnerpages/NLA.html might be useful for that - there's VIAF data dumps at https://viaf.org/viaf/data/. Though I'm guessing there's more to trove than just their authority control records; that's just the only thing I've used it for |
| 01:21:05 | <fireonlive> | link to deathwatch: https://wiki.archiveteam.org/index.php/Deathwatch#2023:~:text=July%3A%20Trove |
| 01:21:07 | <@arkiver> | pages are sequential |
| 01:21:14 | <@arkiver> | everything can be easily downloaded |
| 01:21:24 | <@arkiver> | or well, discovered |
| 01:21:27 | <@arkiver> | and then downloaded |
| 01:21:43 | <@arkiver> | (so i meant easily discovered - downloading required a more complicated Warrior project) |
| 01:22:36 | <@arkiver> | trove is the newspapers/gazettes right? or is there more to it? |
| 01:23:00 | <@arkiver> | all articles also have sequential identifiers |
| 01:26:00 | <pokechu22> | This is the authority control part: https://trove.nla.gov.au/people/1500864 - wikidata has 3 properties: https://www.wikidata.org/wiki/Property:P1315 https://www.wikidata.org/wiki/Property:P5603 https://www.wikidata.org/wiki/Property:P10044 |
| 01:30:02 | <pokechu22> | https://trove.nla.gov.au/help/categories says there's a bunch of other stuff |
| 01:30:05 | <vokunal|m> | Name sounds good. Funny timing. My Dad just brought this home. https://imgur.com/a/63feuCv |
| 01:31:24 | <pokechu22> | an API: https://trove.nla.gov.au/about/create-something/using-api |
| 01:46:22 | | systwi_ joins |
| 02:01:28 | <h2ibot> | Yts98 edited LINE BLOG (+2, Launch the project): https://wiki.archiveteam.org/?diff=49955&oldid=49952 |
| 02:04:28 | <h2ibot> | Yts98 edited Current Projects (+0, Launch LINE BLOG): https://wiki.archiveteam.org/?diff=49956&oldid=49946 |
| 02:17:18 | <flashfire42> | if reddit is currently paused it shouldnt the warriors default choice for archiveteam choice should it? |
| 02:21:30 | <fireonlive> | hm i guess not |
| 02:24:35 | | Dallas quits [Client Quit] |
| 02:30:48 | | Sobex joins |
| 02:31:08 | | systwi__ (systwi) joins |
| 02:31:13 | <Sobex> | can we still retrieve photos from friendster? |
| 02:32:38 | | Sobex quits [Remote host closed the connection] |
| 02:32:58 | | systwi quits [Ping timeout: 265 seconds] |
| 02:34:47 | | beario quits [Ping timeout: 252 seconds] |
| 02:46:21 | | rktk (rktk) joins |
| 02:46:27 | <rktk> | I just saw this today and am mildly concerned |
| 02:46:29 | <rktk> | https://cohost.org/lilrawk/post/1704328-blogger-they-re-kil |
| 02:46:35 | <rktk> | "Google is killing off Album Archive which seems to potentially include Blogger webpages" |
| 02:46:39 | <rktk> | " Their source is "getting their takeout file from this and it being all the contents of their website," which I think is a pretty fair first party source." |
| 02:46:55 | <rktk> | just a heads up... hell the old internet is being wiped pretty fast it seems |
| 02:48:01 | <flashfire42> | https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project |
| 03:11:00 | | systwi__ is now known as systwi |
| 03:11:01 | | yts98 leaves |
| 03:11:09 | | yts98 joins |
| 03:23:03 | | dan- joins |
| 03:29:20 | | dan- is now authenticated as dan- |
| 03:33:14 | | Hajdar quits [Client Quit] |
| 03:36:50 | | BigBrain quits [Remote host closed the connection] |
| 03:37:07 | | BigBrain (bigbrain) joins |
| 04:30:33 | | Arcorann (Arcorann) joins |
| 04:45:53 | <kiska> | arkiver: Trove also have WARCs as well, although it might be restricted downloading(ala WBM). Its pretty much anything that a librarian finds helpful to its patrons. So books would be included here, photos, newspapaers, archived web pages(Australian WBM), magazines, newsletters, maps, music, video, people/organisations and pretty much anything in |
| 04:45:53 | <kiska> | between |
| 04:55:09 | | Pingerfowder quits [Quit: ZNC - https://znc.in] |
| 04:55:20 | | Pingerfowder (Pingerfowder) joins |
| 05:00:01 | | Pingerfowder quits [Client Quit] |
| 05:00:53 | | Pingerfowder (Pingerfowder) joins |
| 05:04:44 | | Pingerfowder quits [Client Quit] |
| 05:07:12 | | justmolamola joins |
| 05:07:19 | | justmolamola_ joins |
| 05:08:54 | | justmolamola_ quits [Client Quit] |
| 05:24:31 | | hitgrr8 joins |
| 05:27:48 | | chrismeller quits [Quit: Ping timeout (120 seconds)] |
| 05:29:11 | | chrismeller (chrismeller) joins |
| 05:43:21 | | chrismeller5 (chrismeller) joins |
| 05:43:26 | | chrismeller quits [Ping timeout: 252 seconds] |
| 05:43:26 | | chrismeller5 is now known as chrismeller |
| 05:44:22 | | dumbgoy_ quits [Ping timeout: 265 seconds] |
| 06:33:52 | | BlueMaxima quits [Read error: Connection reset by peer] |
| 06:49:06 | | chrismeller0 (chrismeller) joins |
| 06:51:33 | | chrismeller quits [Ping timeout: 265 seconds] |
| 06:51:33 | | chrismeller0 is now known as chrismeller |
| 07:05:44 | | yzqzss joins |
| 07:05:53 | | yzqzss quits [Remote host closed the connection] |
| 07:13:41 | <yzqzss|m> | yts98: STWP (a Chinese group of web archivists) is archiving Banciyuan. https://t.me/saveweb/150 |
| 07:22:30 | <yzqzss|m> | We (stwp) have scraped 11M users, 42M items, 0.3M groups and 2.4M circles metadata from its API |
| 07:23:37 | <flashfire42> | yzqzss|m in what format? WARC? TXT? Raw files? |
| 07:29:02 | | justmolamola quits [Ping timeout: 252 seconds] |
| 07:32:28 | <BigBrain> | channel name for stackoverflow decided yet? |
| 07:59:41 | <yzqzss|m> | <flashfire42> "yzqzss in what format? WARC? TXT..." <- We can't use WARC to store this website's content because there is too much of it, and web archiving is slow for it. Since much of the content requires JS to load. |
| 07:59:41 | <yzqzss|m> | So, we used MongoDB to store the metadata from their webapi/appapi. We have already (still inprocessing...) collected over 500GB of metadata, including 11M users, 42M items, 0.4 groups, and 2.4M circles. |
| 08:02:20 | <yzqzss|m> | The most challenging aspect would be how to download the media resource URLs retrieved from the metadata. Yesterday, we used a simple grep to extract 129GB of urls.txt (unfiltered and with duplicate entries) from our current 500GB metadata. We might seek help from here when our urls.txt is ready to download. :) |
| 08:03:14 | <flashfire42> | Urls.txt could be queued up with the warrior maybe |
| 08:21:20 | | tipi joins |
| 08:21:29 | | tipi leaves |
| 08:26:54 | | Island quits [Read error: Connection reset by peer] |
| 08:33:35 | <fireonlive> | BigBrain: not yet |
| 08:37:35 | | zhongfu_ (zhongfu) joins |
| 08:37:59 | | zhongfu quits [Read error: Connection reset by peer] |
| 08:42:19 | | justmolamola joins |
| 08:58:33 | | BearFortress quits [Client Quit] |
| 09:15:55 | | lunik173 quits [Quit: Ping timeout (120 seconds)] |
| 09:17:35 | | lunik173 joins |
| 09:17:41 | | @Sanqui quits [Read error: Connection reset by peer] |
| 09:18:09 | | Sanqui joins |
| 09:18:11 | | Sanqui is now authenticated as Sanqui |
| 09:18:11 | | Sanqui quits [Changing host] |
| 09:18:11 | | Sanqui (Sanqui) joins |
| 09:18:11 | | @ChanServ sets mode: +o Sanqui |
| 09:23:55 | | BearFortress joins |
| 09:38:38 | | decky_e quits [Client Quit] |
| 09:39:32 | | Ruthalas5 quits [Ping timeout: 258 seconds] |
| 09:41:18 | | Ruthalas5 (Ruthalas) joins |
| 09:48:28 | | decky_e (decky_e) joins |
| 10:00:01 | | railen63 quits [Remote host closed the connection] |
| 10:00:18 | | railen63 joins |
| 10:01:02 | | Matthww1 quits [Read error: Connection reset by peer] |
| 10:01:20 | | Matthww1 joins |
| 10:07:57 | <flashfire42> | https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project arkiver any idea? |
| 10:15:01 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 10:34:29 | <yts98> | yzqzss|m: thanks for informing. I'll take a look at saveweb/fourdimensions-archive and the TG group |
| 10:39:37 | | jacksonchen666 (jacksonchen666) joins |
| 10:49:02 | | railen69 joins |
| 10:51:59 | | railen63 quits [Ping timeout: 258 seconds] |
| 11:01:02 | <h2ibot> | JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=49957&oldid=49931 |
| 11:24:53 | <@arkiver> | yts98: can you please post what you have up to now for Banciyuan? |
| 11:25:25 | <@arkiver> | i'd like to get a project started for us - i'd like to check your code similarly as we've done with lineblog |
| 11:25:51 | <flashfire42> | https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project arkiver any idea? (This is something I posted an hour ago or so I just wanted to know I am aware IRC takes ages sometimes to get a reply) |
| 11:26:37 | <@arkiver> | flashfire42: this is unfortunately expected right now :/ dpreview needs some love which it will get |
| 11:28:08 | <flashfire42> | Duly noted. I will avoid it for the minute and ocassionally ask you |
| 11:28:20 | <@arkiver> | thank you! |
| 11:29:55 | <@arkiver> | yzqzss|m: you say 'web archiving is slow for it' - what does that mean? the site is slow? can the site not handle a lot? |
| 11:34:04 | <flashfire42> | is issuu ok to run arkiver ? |
| 11:34:19 | <@arkiver> | flashfire42: no :P |
| 11:34:26 | <@arkiver> | wait let me reorder this |
| 11:35:02 | <flashfire42> | Sorry I am just trying out all the different projects that are still there looking for stuff to run |
| 11:35:50 | <flashfire42> | You are already aware several projects dont work on the warrior anyway |
| 11:37:17 | <@arkiver> | flashfire42: i moved things around! |
| 11:37:20 | <@arkiver> | how does this look? |
| 11:37:29 | <@arkiver> | no need for sorry :) |
| 11:37:56 | <flashfire42> | How do you mean you moved things around? |
| 11:38:16 | <flashfire42> | Ok now I see it |
| 11:38:19 | <flashfire42> | thats better |
| 11:40:23 | | sepro quits [Ping timeout: 252 seconds] |
| 11:53:37 | | decky_e quits [Remote host closed the connection] |
| 11:53:39 | <@arkiver> | let's make a channel for banciyuan, any ideas? |
| 11:54:31 | <flashfire42> | #YuanInTheBank |
| 11:54:51 | <flashfire42> | it kinda sucks but I got nothing else |
| 11:55:38 | <flashfire42> | arkiver if you wanna even consider that one jump in there so I can op you cause I am currently babysitting it |
| 11:55:39 | <@arkiver> | yzqzss|m: if you have a list of URLs then yes please let us know! |
| 11:55:50 | <@arkiver> | we'll likely crawl this as well though, since we need to data in WARCs |
| 11:55:51 | | jacksonchen666 quits [Ping timeout: 245 seconds] |
| 11:56:12 | <h2ibot> | Nemo bis edited Miraheze (+318, /* Backups */ update): https://wiki.archiveteam.org/?diff=49958&oldid=49951 |
| 11:57:12 | <h2ibot> | Nemo bis edited Miraheze (+77, /* Backups */ numbers): https://wiki.archiveteam.org/?diff=49959&oldid=49958 |
| 11:57:33 | <flashfire42> | if #YuanInTheBank is liked can an archiveteam op join it I am babysitting it but going to bed in like half an hour to an hour dont wanna slow down any progress if you like my silly channel name |
| 11:59:21 | | jacksonchen666 (jacksonchen666) joins |
| 12:00:13 | <h2ibot> | JAABot edited CurrentWarriorProject (+1): https://wiki.archiveteam.org/?diff=49960&oldid=49957 |
| 12:02:43 | <@arkiver> | flashfire42: i'm not sure about that one |
| 12:02:50 | <@arkiver> | rewby: lineblog is up and running |
| 12:02:55 | <@arkiver> | next one is up soon |
| 12:03:27 | <@rewby|backup> | Ack |
| 12:06:24 | <@arkiver> | rewby: can you please create a target for Banciyuan? |
| 12:06:26 | <@arkiver> | this would be |
| 12:06:30 | <@arkiver> | archiveteam_banciyuan_ |
| 12:06:36 | <@arkiver> | banciyuan_ |
| 12:06:57 | <@arkiver> | Archive Team ε欑ε
(Banciyuan): |
| 12:07:03 | <@rewby|backup> | Is there a channel i should watch? |
| 12:07:03 | <@arkiver> | the tracker is up |
| 12:07:06 | <@arkiver> | not yet |
| 12:07:15 | <@arkiver> | i'll ping you when we have a channel |
| 12:22:43 | <@rewby> | arkiver: Target up |
| 12:26:09 | <@arkiver> | rewby: thank you! |
| 12:28:14 | | railen63 joins |
| 12:30:30 | | railen69 quits [Ping timeout: 258 seconds] |
| 12:32:36 | | c joins |
| 12:39:20 | <h2ibot> | Nemo bis edited Miraheze (+237, update): https://wiki.archiveteam.org/?diff=49961&oldid=49959 |
| 12:46:02 | | railen69 joins |
| 12:48:44 | | railen63 quits [Ping timeout: 265 seconds] |
| 12:50:40 | | c quits [Ping timeout: 265 seconds] |
| 12:54:53 | <yzqzss|m> | <arkiver> "yzqzss: if you have a list of..." <- Yes we have. They are stored in the database and take some time to export, please wait. |
| 12:55:30 | <yzqzss|m> | But I worry that archiveteam and we(stwp) may be doing duplicate work...π« |
| 12:56:07 | <@arkiver> | yzqzss|m: we would partially be doing that yes |
| 12:56:15 | <@arkiver> | but do you see any slowness from the site if you archive fast? |
| 12:56:16 | | jacksonchen666 quits [Client Quit] |
| 12:56:40 | <@arkiver> | however, we have to get this data in WARCs with HTML, playable in services like the Wayback Machine. |
| 12:57:52 | <@arkiver> | yzqzss|m: if you have not seen signs that the site cannot handle much, then we'll likely move forward with this. if you do see signs the site cannot handle this, let's see further what we should do here |
| 12:59:49 | <@arkiver> | yzqzss|m: perhaps you have an idea for a channel name? |
| 13:11:11 | <yzqzss|m> | <arkiver> "but do you see any slowness from..." <- We haven't tried web page, but at least its webapi/appapi doesn't have any WAF. Be aware though that they have staff monitoring gateway traffic. |
| 13:11:47 | <@arkiver> | yzqzss|m: sounds good in that case |
| 13:11:55 | <@arkiver> | do you plan to stick around at Archive Team for this? |
| 13:12:04 | | AlbertLarsan68 (AlbertLarsan68) joins |
| 13:13:51 | <@arkiver> | so Banciyuan literally means "Half Dimension" - maybe there is an English or Chinese pun on that that we can use for a channel name |
| 13:15:19 | <yzqzss|m> | <arkiver> "yzqzss: we would partially be..." <- I think you guys want to archive banciyuan via webpage. Is your goal to archive the entire site? |
| 13:16:46 | <@arkiver> | yzqzss|m: yes, including images/videos/etc. i'm hoping it'll not be too big. if it turns out be very big, we'll see about perhaps dropping videos or lowering quality. we'd likely also include any API stuff we can find |
| 13:18:32 | | c joins |
| 13:25:12 | | AlbertLarsan68 quits [Client Quit] |
| 13:25:19 | | c quits [Remote host closed the connection] |
| 13:26:32 | | Arcorann quits [Ping timeout: 252 seconds] |
| 13:27:02 | | AlbertLarsan68 (AlbertLarsan68) joins |
| 13:31:24 | <yzqzss|m> | https://transfer.archivete.am/zj9Jf/banciyuan_user_ids.list.txt |
| 13:32:22 | <yzqzss|m> | wc -l --> 10601351 |
| 13:33:42 | <yzqzss|m> | bcy.net/u/{user_id} |
| 13:33:42 | <@arkiver> | yzqzss|m: nice! how complete do you think the list is? |
| 13:34:46 | <@arkiver> | interesting, i see some long IDs and some short IDs. could the short IDs be from an earlier phase in which they were created sequentially? |
| 13:35:45 | <yzqzss|m> | arkiver: Yes! |
| 13:37:56 | | sepro (sepro) joins |
| 13:38:11 | <@rewby> | If I know arkiver at all, the next thought is "we'll iterate over all of them" |
| 13:38:46 | <@arkiver> | rewby: yes for the sequential ones, but some (perhaps later IDs?) seem to be not sequential |
| 13:38:57 | <@arkiver> | we'll queue up all sequential ones and of course keep doing discovery |
| 13:41:59 | <yzqzss|m> | The users in the early years are sequentially short IDs. Then one day, banciyuan assign random int64 long IDs for new users. |
| 13:43:11 | <yzqzss|m> | <arkiver> "yzqzss: nice! how complete do..." <- 95%+, maybe. |
| 13:43:41 | <@arkiver> | very nice |
| 13:45:00 | <AlbertLarsan68> | I have a question about the Warrior tracker's webpage: Is it normal that the top-list (the left table) has a clicky-mouse-pointer, but nothing happens when I click? |
| 13:46:39 | | railen64 joins |
| 13:46:44 | | AmAnd0A quits [Ping timeout: 265 seconds] |
| 13:47:19 | | AmAnd0A joins |
| 13:48:08 | <imer> | highest sequential id seems to be 4.086.616, one after that is 5.287.119 -> 6.513.318 -> 6.704.473 -> 7.079.273 |
| 13:48:22 | <imer> | https://transfer.archivete.am/WJcDX/banciyuan_user_ids.list_sorted.txt if anyone else is curious |
| 13:49:06 | <imer> | highest seems to be 4.503.256.006.921.376 with 3 outliers in the 7.168.705.518.750.483.509 range |
| 13:49:28 | | railen69 quits [Ping timeout: 258 seconds] |
| 13:54:06 | <@arkiver> | imer: is this data from the list of yzqzss|m ? |
| 13:58:40 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 13:58:59 | | AmAnd0A joins |
| 14:06:24 | <yzqzss|m> | https://transfer.archivete.am/bxYDX/banciyuan_uids_from_ranking.txt |
| 14:06:24 | <yzqzss|m> | https://transfer.archivete.am/jUaAQ/banciyuan_item_ids_from_ranking.txt |
| 14:06:24 | <yzqzss|m> | --- |
| 14:06:24 | <yzqzss|m> | These users and items are listed on the leaderboard. archive them first? |
| 14:06:24 | <yzqzss|m> | https://bcy.net/{illust|coser|novel}/toppost100 |
| 14:11:44 | | railen63 joins |
| 14:13:48 | | railen64 quits [Ping timeout: 265 seconds] |
| 14:19:02 | <imer> | arkiver: yes, just sorted it out of curiosity |
| 14:47:08 | | railen69 joins |
| 14:50:03 | | railen63 quits [Ping timeout: 265 seconds] |
| 14:50:11 | | toss (toss) joins |
| 15:00:24 | <nstrom|m> | How about #bcbye as channel name? /shrug |
| 15:01:06 | <@arkiver> | it sounds pretty nice |
| 15:03:48 | | dumbgoy_ joins |
| 15:06:28 | <BigBrain> | bancigone |
| 15:10:28 | <mattx433> | begoneyuan? |
| 15:11:28 | <yzqzss|m> | We just extracted the picture URLs in all current items. After deduplication, there are about 67054075 pictures. (Not the final result, we are still crawling, and new items are still added to the DB) |
| 15:12:12 | <@arkiver> | yzqzss|m: how large do you think your final dump will be? |
| 15:17:31 | | railen64 joins |
| 15:20:30 | | railen69 quits [Ping timeout: 265 seconds] |
| 15:23:52 | | railen69 joins |
| 15:26:10 | <yzqzss|m> | IDK, but it should be tens~hundreds of TiB. And we haven't counted the number of videos yet. |
| 15:26:27 | | railen64 quits [Ping timeout: 258 seconds] |
| 15:40:16 | <yzqzss|m> | Also, I personally don't recommend archiving videos, downloading them in bulk will draw the attention of employees. And most of the videos are also available on BiliBili (www.bilibili.com) |
| 15:45:48 | | Minkafighter quits [Client Quit] |
| 15:46:04 | | Minkafighter joins |
| 15:53:28 | <yzqzss|m> | * A few years ago, ByteDance (parent company of TikTok) acquired banciyuan. Although the banciyuan team is gradually abandoning the maintenance of the website, ByteDance's traffic audit team is still there. |
| 15:53:28 | <yzqzss|m> | * The audit team banned our crawler account a few days ago. We have tried to contact them, but they are difficult to negotiate with :( |
| 15:54:04 | | railen64 joins |
| 15:55:14 | | sonick quits [Quit: Connection closed for inactivity] |
| 15:55:36 | | dumbgoy__ joins |
| 15:57:14 | | railen69 quits [Ping timeout: 265 seconds] |
| 15:58:15 | | Dango360 (Dango360) joins |
| 15:59:39 | | dumbgoy_ quits [Ping timeout: 265 seconds] |
| 16:04:32 | <yts98> | arkiver: I'm still working on handling image URL rules, and my code will cover webpages, webapis, images, videos, danmakus |
| 16:07:09 | | Aoede_ is now known as Aoede |
| 16:07:34 | | justmolamola quits [Remote host closed the connection] |
| 16:10:23 | | driib quits [Quit: The Lounge - https://thelounge.chat] |
| 16:10:44 | | driib (driib) joins |
| 16:15:33 | <AlbertLarsan68> | Is the team interested in archiving a French Torrent site (type TPB)? |
| 16:15:33 | <AlbertLarsan68> | The French legislation is chasing them, they have to change their domain constantly, to avoid DNS blocking my all major ISPs. |
| 16:16:27 | <yzqzss|m> | yts98: Do you have Telegram? Can you join https://t.me/saveweb_projects/319 to discuss about banciyuan archiving with us? |
| 16:16:45 | | driib quits [Remote host closed the connection] |
| 16:17:04 | | driib (driib) joins |
| 16:17:35 | <yts98> | yzqzss|m: ok, as my navite language is Traditional Chinese |
| 16:24:13 | <yzqzss|m> | ποΈποΈ |
| 16:30:07 | | driib quits [Remote host closed the connection] |
| 16:30:33 | | driib (driib) joins |
| 16:31:20 | | toss quits [Ping timeout: 252 seconds] |
| 16:33:57 | <cronfox> | yts98: ohhhhh |
| 16:35:06 | <yts98> | discussing with STWP. I'll update findings here and on wiki |
| 16:43:47 | | driib quits [Remote host closed the connection] |
| 16:44:07 | | driib (driib) joins |
| 16:44:53 | | VerifiedJ quits [Quit: The Lounge - https://thelounge.chat] |
| 16:46:29 | | VerifiedJ (VerifiedJ) joins |
| 16:53:00 | | toss (toss) joins |
| 16:54:41 | | toss_ (toss) joins |
| 16:56:04 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 16:56:14 | | AmAnd0A joins |
| 16:57:20 | | that_lurker quits [Client Quit] |
| 16:57:29 | | that_lurker (that_lurker) joins |
| 16:57:36 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 16:57:44 | | toss quits [Ping timeout: 252 seconds] |
| 16:57:52 | | AmAnd0A joins |
| 16:58:10 | | driib quits [Remote host closed the connection] |
| 16:58:29 | | driib (driib) joins |
| 17:12:46 | | driib quits [Remote host closed the connection] |
| 17:13:06 | | driib (driib) joins |
| 17:13:42 | | toss (toss) joins |
| 17:17:48 | | BigBrain quits [Remote host closed the connection] |
| 17:18:05 | | BigBrain (bigbrain) joins |
| 17:18:05 | | toss_ quits [Ping timeout: 252 seconds] |
| 17:26:00 | | driib quits [Remote host closed the connection] |
| 17:26:21 | | driib (driib) joins |
| 17:27:35 | | @rewby quits [Ping timeout: 258 seconds] |
| 17:34:55 | | rewby (rewby) joins |
| 17:34:55 | | @ChanServ sets mode: +o rewby |
| 17:35:41 | | decky_e (decky_e) joins |
| 17:39:25 | | driib quits [Remote host closed the connection] |
| 17:39:45 | | driib (driib) joins |
| 17:44:36 | | decky_e quits [Read error: Connection reset by peer] |
| 17:44:58 | | decky_e (decky_e) joins |
| 17:50:31 | | driib quits [Remote host closed the connection] |
| 17:50:50 | | driib (driib) joins |
| 17:52:22 | <kiska> | I have a suggestion for channel name wuciyuan? :D |
| 17:52:34 | <kiska> | Or simplfied chinese ζ 欑ε
|
| 17:52:49 | <kiska> | Which translates to dimensionless |
| 18:00:09 | | toss leaves |
| 18:02:03 | | driib quits [Remote host closed the connection] |
| 18:02:22 | | driib (driib) joins |
| 18:08:44 | | driib quits [Remote host closed the connection] |
| 18:09:03 | | driib (driib) joins |
| 18:13:26 | | eroc1990 quits [Quit: The Lounge - https://thelounge.chat] |
| 18:14:30 | | _Dango360 quits [Quit: Leaving] |
| 18:14:37 | | Dango360 quits [Client Quit] |
| 18:14:46 | | Dango360 (Dango360) joins |
| 18:14:57 | | eroc1990 (eroc1990) joins |
| 18:20:58 | <yts98> | kiska: now there are δΊζ¬‘ε
(2), Banciyuan(1/2), saveweb/fourdimensions-archive (4), wuciyuan (null). lol |
| 18:22:28 | <yts98> | but δΊζ¬‘ε
(five dimensional) is also pronounced wuciyuan |
| 18:23:24 | | driib quits [Remote host closed the connection] |
| 18:23:44 | | driib (driib) joins |
| 18:24:38 | | decky_e quits [Ping timeout: 252 seconds] |
| 18:25:58 | | decky_e (decky_e) joins |
| 18:27:05 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 18:27:10 | | AmAnd0A joins |
| 18:28:14 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 18:28:55 | | AmAnd0A joins |
| 18:30:26 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 18:30:43 | | AmAnd0A joins |
| 18:34:38 | | driib quits [Read error: Connection reset by peer] |
| 18:34:57 | | driib (driib) joins |
| 18:45:59 | | Megame (Megame) joins |
| 18:46:26 | | tzt quits [Remote host closed the connection] |
| 18:46:48 | | tzt (tzt) joins |
| 18:46:53 | | imer quits [Ping timeout: 265 seconds] |
| 18:48:03 | | railen64 quits [Remote host closed the connection] |
| 18:48:37 | | driib quits [Remote host closed the connection] |
| 18:48:57 | | driib (driib) joins |
| 18:51:01 | | railen63 joins |
| 18:54:44 | | imer (imer) joins |
| 18:57:15 | <kiska> | You could do ε欑ε
for siciyuan and if you pronounce si in cantonese :D |
| 18:57:50 | <kiska> | Could be funny saying dead dimension :D |
| 18:59:38 | <BigBrain> | do irc channel names only allow ascii? |
| 18:59:55 | <kiska> | We do have #// for urls so perhaps? |
| 19:00:46 | <kiska> | 05:00:23 [479]#ζ 欑ε
Illegal channel name |
| 19:00:49 | <kiska> | Sadness! |
| 19:00:58 | <BigBrain> | ε欑ε
worked |
| 19:01:15 | <kiska> | Yeah I tried that one first |
| 19:01:38 | <BigBrain> | maybe utf-8? |
| 19:02:02 | <kiska> | Traditional Chinese dimensionless is η‘欑ε
|
| 19:02:30 | <BigBrain> | works |
| 19:03:15 | <kiska> | Perhaps I should complain about the other one not working :D |
| 19:03:34 | <kiska> | Its either The Lounge not liking it or irc |
| 19:04:07 | <BigBrain> | weechat as well |
| 19:04:12 | <BigBrain> | probably irc |
| 19:04:41 | | driib quits [Remote host closed the connection] |
| 19:05:00 | | driib (driib) joins |
| 19:06:41 | | AlbertLarsan68 quits [Client Quit] |
| 19:16:22 | | tzt quits [Ping timeout: 265 seconds] |
| 19:18:44 | | driib quits [Remote host closed the connection] |
| 19:19:02 | | driib (driib) joins |
| 19:23:12 | | tzt (tzt) joins |
| 19:26:29 | <@arkiver> | yts98: i'd like to start it asap - perhaps you can post online what you have now? it'll likely be grately refactored anyway, just like with lineblog |
| 19:27:58 | <@arkiver> | yts98: the most useful information here is what is posted on the wiki - i've started a bit on a project now as would like to incorporate whatever you have in your code as well, to get this started as soon as possible. |
| 19:28:06 | | driib quits [Remote host closed the connection] |
| 19:28:25 | | driib (driib) joins |
| 19:29:06 | <@arkiver> | kiska: yts98: BigBrain: let's go with something in latin script |
| 19:30:23 | <kiska> | I'll part the 2 channels I made then |
| 19:31:13 | <@arkiver> | let's go with #wuciyuan |
| 19:34:09 | | driib quits [Remote host closed the connection] |
| 19:34:28 | | decky_e quits [Ping timeout: 258 seconds] |
| 19:34:29 | | driib (driib) joins |
| 19:34:58 | | decky_e (decky_e) joins |
| 19:35:30 | | Twisty joins |
| 19:35:40 | <@arkiver> | yzqzss|m: i mean - is this what you plan to archive, or would this be the total size given unlimited space to archive? |
| 19:38:08 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 19:38:14 | | AmAnd0A joins |
| 19:39:19 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 19:39:20 | <Twisty> | arkiver Regarding Vine, I don't know about coding or how it works, it's a thing that just came to mind since I've been an avid Vine consumer back in 2015 and while there were and still are compilations on YT, I don't think there has ever been an _actual_ archive made of them besides the official Twitter one which more or less shut down in 2019. |
| 19:39:47 | | Island joins |
| 19:39:56 | | AmAnd0A joins |
| 19:40:31 | <BigBrain> | Twisty: anywhere to scrape for ids? or just bruteforce slowly? |
| 19:40:59 | <Twisty> | I'd guess bruteforcing, here is a link for example https://vine.co/v/5eX56hWvxBU |
| 19:42:15 | <Twisty> | https://vine.co/u/923279753992089600 There are also links to profiles that look like this |
| 19:42:47 | | AmAnd0A quits [Read error: Connection reset by peer] |
| 19:43:02 | | AmAnd0A joins |
| 19:44:32 | <h2ibot> | Arkiver uploaded File:Banciyuan-icon.png: https://wiki.archiveteam.org/?title=File%3ABanciyuan-icon.png |
| 19:46:40 | | driib quits [Remote host closed the connection] |
| 19:46:58 | | driib (driib) joins |
| 19:50:22 | | nix78 joins |
| 19:58:32 | | carnage joins |
| 19:59:49 | | driib quits [Remote host closed the connection] |
| 20:00:08 | | driib (driib) joins |
| 20:00:40 | | imer quits [Client Quit] |
| 20:02:42 | | imer (imer) joins |
| 20:08:45 | | andrew quits [Quit: ] |
| 20:11:16 | | driib quits [Remote host closed the connection] |
| 20:11:34 | | driib (driib) joins |
| 20:12:03 | | carnage56 joins |
| 20:12:46 | | carnage56 quits [Remote host closed the connection] |
| 20:13:13 | | BigBrain quits [Remote host closed the connection] |
| 20:13:33 | | BigBrain (bigbrain) joins |
| 20:15:11 | <nicolas17> | Twisty: those vine IDs seem too big to reasonably bruteforce |
| 20:15:36 | <masterX244> | seems like the WARCrippers need to be fired up again... |
| 20:15:42 | <Twisty> | :C |
| 20:15:44 | <masterX244> | if we want to hunt down links |
| 20:15:49 | | carnage quits [Ping timeout: 265 seconds] |
| 20:19:42 | <Twisty> | I guess that makes sense, I just hope it'd be possible to archive them. I don't think Vine is as important as Reddit or Imgur but considering its impact it had and the content created troughout mid 2010s, it would be worth taking a look at possible archival options. |
| 20:23:17 | | driib quits [Remote host closed the connection] |
| 20:23:38 | | driib (driib) joins |
| 20:24:29 | | Megame quits [Client Quit] |
| 20:28:39 | <masterX244> | too bad that common-crawl is still off-limit for processing after we got amazon to wave a white flag after we got sustained 120GBit/S on grepping data |
| 20:31:12 | <@arkiver> | note to anyone who wants to start running bcy.net - your IP will be archived in the page source |
| 20:36:25 | | TTN joins |
| 20:36:50 | | driib quits [Remote host closed the connection] |
| 20:37:10 | | driib (driib) joins |
| 20:37:57 | <fireonlive> | :o |
| 20:41:39 | | Twisty quits [Remote host closed the connection] |
| 20:41:49 | <kiska> | Thats problematic :D |
| 20:45:55 | <flashfire42> | arkiver well that fucking sucks. Will it be a warrior project? |
| 20:50:40 | | driib quits [Remote host closed the connection] |
| 20:50:58 | | driib (driib) joins |
| 20:58:46 | <h2ibot> | Nemo bis edited Wikimedia Commons (+287, /* Other dumps */ we're not quite planning toβ¦): https://wiki.archiveteam.org/?diff=49963&oldid=48033 |
| 21:00:13 | | andrew (andrew) joins |
| 21:01:48 | | Unholy2361 quits [Quit: The Lounge - https://thelounge.chat] |
| 21:02:13 | | tzt quits [Ping timeout: 265 seconds] |
| 21:02:19 | | Unholy2361 (Unholy2361) joins |
| 21:02:58 | | driib quits [Remote host closed the connection] |
| 21:03:17 | | driib (driib) joins |
| 21:03:24 | | tzt (tzt) joins |
| 21:04:00 | <rktk> | Vine?? |
| 21:04:07 | <rktk> | Are those links even still valid? I can't imgaine |
| 21:05:10 | <fireonlive> | vine.co hosts an archive |
| 21:05:28 | <fireonlive> | they used to have 'top vines' etc but have since killed that |
| 21:06:09 | <fireonlive> | searching site:vine.co reveals usernames like https://vine.co/TheGabbieShow/ |
| 21:07:23 | <fireonlive> | direct links seem to be 'hard to get' from there but here's a random ID: https://vine.co/v/iU3uhlJrzBm |
| 21:07:38 | <fireonlive> | they're quite long |
| 21:08:04 | <rktk> | oh wow, Vine.co is still online |
| 21:08:21 | | driib quits [Remote host closed the connection] |
| 21:08:26 | <flashfire42> | Could have it be like the #down-the-tube channel where individual vids can be queued or whole lists like #telegrab |
| 21:08:35 | <rktk> | but Vine is owned by Twitter, and with Elon's shitstorm, it could in theory be dumped eh |
| 21:08:38 | <nicolas17> | yep there's no way we can bruteforce 52036560683837093888 IDs :P |
| 21:08:38 | | driib (driib) joins |
| 21:08:44 | <nicolas17> | rktk: shh |
| 21:08:45 | <rktk> | "Why are we hosting all these old videos nobody is watching anymore" |
| 21:08:50 | <nicolas17> | elon probably doesn't remember vine exists |
| 21:08:53 | <rktk> | ^ |
| 21:09:05 | | tzt quits [Ping timeout: 252 seconds] |
| 21:09:10 | <rktk> | I'll put some old Vine dumps I grabbed for a few channels I liked, this was before they put it in archive state |
| 21:09:11 | <fireonlive> | unless iU3uhlJrzBm is some kinda encoded thing :p |
| 21:09:17 | <rktk> | that looks like a Youtube ID |
| 21:09:29 | <fireonlive> | i think elon said he wanted to bring vine back.. but he's said a lot of things |
| 21:11:23 | <fireonlive> | i randomly found this; unsure if it's anything: https://gist.github.com/bradserbu/4d3a10b54cb8895e6ca7 |
| 21:11:41 | <nicolas17> | wth |
| 21:11:55 | <fireonlive> | might be something else |
| 21:12:07 | <rktk> | Actually it seems my vine archives are a bit useless because they have JSON video URL pointers to the old URLs |
| 21:12:12 | <fireonlive> | oh it says vine.co in the comments |
| 21:12:15 | <rktk> | and it seems most if not all of these videos still exist on vine |
| 21:12:51 | <fireonlive> | vine.co still has some sort of takedown process (documented, anyways) |
| 21:13:25 | <fireonlive> | https://help.twitter.com/en/using-twitter/vine-faqs "Yes. If you wish to delete your Vine account, you can let us know by emailing us at vinehelp@twitter.com with a link to your Vine profile page (i.e., https://vine.co/MyUserName)." |
| 21:13:37 | <fireonlive> | idk what, if any, sort of verification process they'd use there |
| 21:13:55 | <@arkiver> | let's move vine chat to #vinewhine |
| 21:16:26 | <rktk> | Done sorry. |
| 21:20:21 | | driib quits [Remote host closed the connection] |
| 21:20:40 | | driib (driib) joins |
| 21:20:52 | | Ruthalas5 quits [Client Quit] |
| 21:21:11 | | Ruthalas5 (Ruthalas) joins |
| 21:30:02 | | driib quits [Remote host closed the connection] |
| 21:30:22 | | driib (driib) joins |
| 21:30:51 | <h2ibot> | Nemo bis edited Wikimedia Commons (+221, update query): https://wiki.archiveteam.org/?diff=49964&oldid=49963 |
| 21:33:07 | | hitgrr8 quits [Client Quit] |
| 21:35:19 | | driib quits [Remote host closed the connection] |
| 21:35:39 | | driib (driib) joins |
| 21:41:08 | | driib quits [Remote host closed the connection] |
| 21:41:24 | | driib (driib) joins |
| 21:47:18 | | driib quits [Read error: Connection reset by peer] |
| 21:47:38 | | driib (driib) joins |
| 21:53:32 | | driib quits [Remote host closed the connection] |
| 21:53:52 | | driib (driib) joins |
| 22:00:57 | <h2ibot> | JAABot edited CurrentWarriorProject (-1): https://wiki.archiveteam.org/?diff=49965&oldid=49960 |
| 22:04:29 | | Misty joins |
| 22:05:09 | | driib quits [Remote host closed the connection] |
| 22:05:28 | | driib (driib) joins |
| 22:07:03 | <Misty> | @arkiver @nicolas17 can IA store the picture we downloading from banciyuan? |
| 22:07:24 | <Misty> | should be over 200TB & about 0.6 billion |
| 22:08:35 | <nicolas17> | I haven't looked into banciyuan at all yet |
| 22:08:50 | <Misty> | oh, i see you in the chat history |
| 22:09:09 | <nicolas17> | yeah but I didn't pay attention to the details of that project :D |
| 22:09:39 | <nicolas17> | the imgur project has archived 590TB so far |
| 22:09:40 | <Misty> | well i'm just asking, we will go on with or without IA, but getting any kind of support (like storage) will always be good |
| 22:10:05 | <Misty> | yeah, but imgur is pretty famous lo |
| 22:10:07 | <Misty> | yeah, but imgur is pretty famous lol |
| 22:10:27 | <nicolas17> | I have no contact with IA to say if that's okay or not, so yeah wait for ark1ver |
| 22:11:07 | | driib quits [Remote host closed the connection] |
| 22:11:25 | | driib (driib) joins |
| 22:11:28 | <Misty> | nicolas17 thanks, do you know other people have relation with IA? |
| 22:20:17 | <yts98> | just uploaded untested code to github.com/yts98/banciyuan-grab . arkiver yzqzss|m can take a look. |
| 22:21:51 | <Misty> | oh @yts98 got you online :) |
| 22:22:17 | <Misty> | so can ia hosting pictures downloaded by us? |
| 22:23:00 | | driib quits [Remote host closed the connection] |
| 22:23:19 | | driib (driib) joins |
| 22:24:40 | | decky_e quits [Ping timeout: 258 seconds] |
| 22:25:04 | | nix78 quits [Remote host closed the connection] |
| 22:25:09 | | decky_e (decky_e) joins |
| 22:25:57 | <@arkiver> | Misty: depends on how much |
| 22:26:13 | <@arkiver> | 200 TB? |
| 22:26:20 | <@arkiver> | no, that will likely not happen |
| 22:26:24 | | skyrocket quits [Client Quit] |
| 22:26:50 | <Misty> | arkiverus thanks, usually how large would they accepting? |
| 22:26:58 | <Misty> | arkiver thanks, usually how large would they accepting? |
| 22:27:14 | <@arkiver> | Misty: banciyuan has multiple sizes of images, smaller ones and "original" quality images. which sizes are you talking about? |
| 22:27:59 | <Misty> | i mean total size, and we are backuping 'original" images |
| 22:28:00 | <@arkiver> | Misty: yts98: yzqzss|m: let's move to #wuciyuan |
| 22:28:32 | | fishingforsoup joins |
| 22:30:02 | <h2ibot> | Yts98 edited Banciyuan (+2199, Sync findings with Save The Web Project): https://wiki.archiveteam.org/?diff=49966&oldid=49954 |
| 22:35:58 | | skyrocket joins |
| 22:36:27 | | driib quits [Remote host closed the connection] |
| 22:36:28 | | decky_e quits [Ping timeout: 265 seconds] |
| 22:36:45 | | driib (driib) joins |
| 22:36:50 | | decky_e (decky_e) joins |
| 22:37:09 | | skyrocket quits [Client Quit] |
| 22:42:09 | | sonick (sonick) joins |
| 22:48:03 | | decky_e quits [Ping timeout: 258 seconds] |
| 22:49:54 | | decky_e (decky_e) joins |
| 22:50:46 | | driib quits [Remote host closed the connection] |
| 22:51:05 | | driib (driib) joins |
| 22:52:05 | <h2ibot> | Yts98 edited Banciyuan (+242, Update IRC and describe image URLs): https://wiki.archiveteam.org/?diff=49967&oldid=49966 |
| 22:55:47 | | decky_e quits [Ping timeout: 252 seconds] |
| 22:56:27 | | decky_e (decky_e) joins |
| 23:02:25 | | decky_e quits [Remote host closed the connection] |
| 23:04:13 | | driib quits [Remote host closed the connection] |
| 23:04:32 | | driib (driib) joins |
| 23:06:14 | | Ruthalas5 quits [Ping timeout: 252 seconds] |
| 23:08:48 | | Ruthalas5 (Ruthalas) joins |
| 23:17:04 | | Ruthalas5 quits [Ping timeout: 265 seconds] |
| 23:19:26 | | AmAnd0A quits [Ping timeout: 252 seconds] |
| 23:19:34 | | driib quits [Remote host closed the connection] |
| 23:19:43 | | AmAnd0A joins |
| 23:19:53 | | driib (driib) joins |
| 23:21:13 | | Ruthalas5 (Ruthalas) joins |
| 23:22:55 | | Ruthalas5 quits [Client Quit] |
| 23:27:19 | | rocketdive joins |
| 23:28:41 | | AmAnd0A quits [Ping timeout: 258 seconds] |
| 23:28:46 | <nicolas17> | a friend wants to more systematically archive stuff from Apple's "update catalogs", such as the macOS InstallAssistant.pkg |
| 23:30:22 | <nicolas17> | and he asked about deduplication, since when a beta version is published as both "developer beta" and "public beta", you end up with two different URLs with the same 12GB file |
| 23:30:46 | <nicolas17> | so I suggested using WARCs |
| 23:31:17 | <nicolas17> | get deduplication + preserve headers (although they don't matter much in practice) + they work on the WBM |
| 23:32:00 | <nicolas17> | except... only some people are allowed to upload WARCs such that they get indexed by WBM, so that's the biggest advantage killed |
| 23:34:16 | | driib quits [Remote host closed the connection] |
| 23:34:21 | <@arkiver> | nicolas17: it would be awesome to have those |
| 23:34:26 | <@arkiver> | can't we use archivebot for that? |
| 23:34:36 | | driib (driib) joins |
| 23:35:16 | <nicolas17> | would archivebot dedup those? 12GB seems pretty big to store twice... |
| 23:35:29 | <@JAA> | AB does not dedupe anything. |
| 23:35:42 | <@arkiver> | well |
| 23:35:49 | <@arkiver> | question is how often would it get new 12 GBs? |
| 23:35:52 | <@JAA> | This would fit perfectly into my planned software binaries thing, but ETA unknown. |
| 23:35:59 | <@arkiver> | once a month? i wouldn't worry about deduplication |
| 23:36:06 | <@arkiver> | once per hour? maybe deduplication would be nice |
| 23:36:36 | <nicolas17> | https://archive.org/details/sucatalog_012-71768 someone manually uploaded these a while ago (many of which were already deleted from Apple CDN!) |
| 23:36:55 | <nicolas17> | and in that example, the same file is stored 4 times |
| 23:37:07 | <nicolas17> | {public beta, dev beta} x {http, https} |
| 23:37:19 | <flashfire42> | Hash the entries with SHA256 if you get a match on 2 things flag for manual review to start deduplication. Tho that could set a dangerous precedent |
| 23:37:34 | <@arkiver> | you don't have to get them with both http and https, once of two is fine |
| 23:37:40 | <@arkiver> | the two URLs are important though |
| 23:37:49 | <@arkiver> | nicolas17: how often are these released? |
| 23:38:38 | <nicolas17> | I think there's multiple 'catalog' XML files and some link to the pkg in http and some in https, so that's where that dup came from, wget blindly saved both |
| 23:40:31 | | rocketdive quits [Client Quit] |
| 23:40:53 | | quackifi joins |
| 23:40:57 | | decky_e joins |
| 23:41:43 | <nicolas17> | arkiver: https://theapplewiki.com/wiki/Beta_Firmware/Mac/13.x this should give a rough idea of frequency |
| 23:43:14 | <h2ibot> | LostArchiver edited Project Sonar (+642, Public no longer has access to data.): https://wiki.archiveteam.org/?diff=49968&oldid=46907 |
| 23:44:16 | | dhinakg joins |
| 23:44:58 | <@arkiver> | nicolas17: if it is a major obstacle to not duplicate data, then feel free to duplicate the 12 GB here for both URL |
| 23:45:00 | <@arkiver> | URLs* |
| 23:45:23 | <nicolas17> | dhinakg: https://hackint.logs.kiska.pw/archiveteam-bs/20230618#c353318 |
| 23:45:58 | <dhinakg> | thx |
| 23:46:56 | <nicolas17> | arkiver: well, "throwing the URLs into archivebot" would be the easiest way to get them into WBM, anything else would need messing with WARC-creating tools, doing our own download/upload, and being whitelisted to get said WARCs into a WBM-indexed collection |
| 23:46:58 | <dhinakg> | so i'm not worried about http/https, that's very easy to deduplicate as the links for the "products" (term for each entry in a catalog) are identical and the only difference is http/https |
| 23:47:19 | | quackifi quits [Read error: Connection reset by peer] |
| 23:47:26 | <dhinakg> | but commonly apple will reupload a developer beta onto the public beta catalog, with different links, even though it's the exact same files |
| 23:47:35 | <@arkiver> | nicolas17: agreed, AB seems like the best options |
| 23:47:42 | <@arkiver> | if it handles the 12 GB files well |
| 23:48:03 | <@JAA> | It does, on a suitable pipeline and with concurrency 1. |
| 23:48:32 | | quackifi joins |
| 23:48:37 | <dhinakg> | however the metadata on the public beta catalog would be different |
| 23:48:45 | | driib quits [Remote host closed the connection] |
| 23:49:04 | | driib (driib) joins |
| 23:52:07 | | dhinakg is now authenticated as dhinakg |
| 23:53:05 | | BlueMaxima joins |
| 23:56:57 | | driib quits [Read error: Connection reset by peer] |
| 23:57:15 | | driib (driib) joins |
| 23:58:57 | | dhinakg quits [Remote host closed the connection] |