00:00:20lennier1 (lennier1) joins
00:06:28railen69 quits [Remote host closed the connection]
00:09:35Aoede quits [Ping timeout: 252 seconds]
00:09:46Aoede_ (Aoede) joins
00:12:26BlueMaxima quits [Read error: Connection reset by peer]
00:12:35BlueMaxima joins
00:15:19Dallas (Dallas) joins
00:21:35railen63 joins
00:56:49Hajdar quits [Remote host closed the connection]
01:00:46Hajdar (Hajdar) joins
01:13:37<@arkiver>anyone have a channel name idea for "Trove"? see Deatchwatch
01:14:30<flashfire42>https://cohost.org/lilrawk/post/1704328-blogger-they-re-kil
01:15:38<flashfire42>TrovePlundering arkiver
01:15:46<flashfire42>I dunno something along the lines of treasure trove
01:16:21<flashfire42>TreasureTheTrashedTrove
01:16:49<flashfire42>TreasureTheTrashedTrove could work. Its getting trashed and we will treasure what we grab
01:17:06<flashfire42>#treasurethetrashedtrove
01:20:10<pokechu22>https://viaf.org/viaf/partnerpages/NLA.html might be useful for that - there's VIAF data dumps at https://viaf.org/viaf/data/. Though I'm guessing there's more to trove than just their authority control records; that's just the only thing I've used it for
01:21:05<fireonlive>link to deathwatch: https://wiki.archiveteam.org/index.php/Deathwatch#2023:~:text=July%3A%20Trove
01:21:07<@arkiver>pages are sequential
01:21:14<@arkiver>everything can be easily downloaded
01:21:24<@arkiver>or well, discovered
01:21:27<@arkiver>and then downloaded
01:21:43<@arkiver>(so i meant easily discovered - downloading required a more complicated Warrior project)
01:22:36<@arkiver>trove is the newspapers/gazettes right? or is there more to it?
01:23:00<@arkiver>all articles also have sequential identifiers
01:26:00<pokechu22>This is the authority control part: https://trove.nla.gov.au/people/1500864 - wikidata has 3 properties: https://www.wikidata.org/wiki/Property:P1315 https://www.wikidata.org/wiki/Property:P5603 https://www.wikidata.org/wiki/Property:P10044
01:30:02<pokechu22>https://trove.nla.gov.au/help/categories says there's a bunch of other stuff
01:30:05<vokunal|m>Name sounds good. Funny timing. My Dad just brought this home. https://imgur.com/a/63feuCv
01:31:24<pokechu22>an API: https://trove.nla.gov.au/about/create-something/using-api
01:46:22systwi_ joins
02:01:28<h2ibot>Yts98 edited LINE BLOG (+2, Launch the project): https://wiki.archiveteam.org/?diff=49955&oldid=49952
02:04:28<h2ibot>Yts98 edited Current Projects (+0, Launch LINE BLOG): https://wiki.archiveteam.org/?diff=49956&oldid=49946
02:17:18<flashfire42>if reddit is currently paused it shouldnt the warriors default choice for archiveteam choice should it?
02:21:30<fireonlive>hm i guess not
02:24:35Dallas quits [Client Quit]
02:30:48Sobex joins
02:31:08systwi__ (systwi) joins
02:31:13<Sobex>can we still retrieve photos from friendster?
02:32:38Sobex quits [Remote host closed the connection]
02:32:58systwi quits [Ping timeout: 265 seconds]
02:34:47beario quits [Ping timeout: 252 seconds]
02:46:21rktk (rktk) joins
02:46:27<rktk>I just saw this today and am mildly concerned
02:46:29<rktk>https://cohost.org/lilrawk/post/1704328-blogger-they-re-kil
02:46:35<rktk>"Google is killing off Album Archive which seems to potentially include Blogger webpages"
02:46:39<rktk>" Their source is "getting their takeout file from this and it being all the contents of their website," which I think is a pretty fair first party source."
02:46:55<rktk>just a heads up... hell the old internet is being wiped pretty fast it seems
02:48:01<flashfire42>https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project
03:11:00systwi__ is now known as systwi
03:11:01yts98 leaves
03:11:09yts98 joins
03:23:03dan- joins
03:33:14Hajdar quits [Client Quit]
03:36:50BigBrain quits [Remote host closed the connection]
03:37:07BigBrain (bigbrain) joins
04:30:33Arcorann (Arcorann) joins
04:45:53<kiska>arkiver: Trove also have WARCs as well, although it might be restricted downloading(ala WBM). Its pretty much anything that a librarian finds helpful to its patrons. So books would be included here, photos, newspapaers, archived web pages(Australian WBM), magazines, newsletters, maps, music, video, people/organisations and pretty much anything in
04:45:53<kiska>between
04:55:09Pingerfowder quits [Quit: ZNC - https://znc.in]
04:55:20Pingerfowder (Pingerfowder) joins
05:00:01Pingerfowder quits [Client Quit]
05:00:53Pingerfowder (Pingerfowder) joins
05:04:44Pingerfowder quits [Client Quit]
05:07:12justmolamola joins
05:07:19justmolamola_ joins
05:08:54justmolamola_ quits [Client Quit]
05:24:31hitgrr8 joins
05:27:48chrismeller quits [Quit: Ping timeout (120 seconds)]
05:29:11chrismeller (chrismeller) joins
05:43:21chrismeller5 (chrismeller) joins
05:43:26chrismeller quits [Ping timeout: 252 seconds]
05:43:26chrismeller5 is now known as chrismeller
05:44:22dumbgoy_ quits [Ping timeout: 265 seconds]
06:33:52BlueMaxima quits [Read error: Connection reset by peer]
06:49:06chrismeller0 (chrismeller) joins
06:51:33chrismeller quits [Ping timeout: 265 seconds]
06:51:33chrismeller0 is now known as chrismeller
07:05:44yzqzss joins
07:05:53yzqzss quits [Remote host closed the connection]
07:13:41<yzqzss|m>yts98: STWP (a Chinese group of web archivists) is archiving Banciyuan. https://t.me/saveweb/150
07:22:30<yzqzss|m>We (stwp) have scraped 11M users, 42M items, 0.3M groups and 2.4M circles metadata from its API
07:23:37<flashfire42>yzqzss|m in what format? WARC? TXT? Raw files?
07:29:02justmolamola quits [Ping timeout: 252 seconds]
07:32:28<BigBrain>channel name for stackoverflow decided yet?
07:59:41<yzqzss|m><flashfire42> "yzqzss in what format? WARC? TXT..." <- We can't use WARC to store this website's content because there is too much of it, and web archiving is slow for it. Since much of the content requires JS to load.
07:59:41<yzqzss|m>So, we used MongoDB to store the metadata from their webapi/appapi. We have already (still inprocessing...) collected over 500GB of metadata, including 11M users, 42M items, 0.4 groups, and 2.4M circles.
08:02:20<yzqzss|m>The most challenging aspect would be how to download the media resource URLs retrieved from the metadata. Yesterday, we used a simple grep to extract 129GB of urls.txt (unfiltered and with duplicate entries) from our current 500GB metadata. We might seek help from here when our urls.txt is ready to download. :)
08:03:14<flashfire42>Urls.txt could be queued up with the warrior maybe
08:21:20tipi joins
08:21:29tipi leaves
08:26:54Island quits [Read error: Connection reset by peer]
08:33:35<fireonlive>BigBrain: not yet
08:37:35zhongfu_ (zhongfu) joins
08:37:59zhongfu quits [Read error: Connection reset by peer]
08:42:19justmolamola joins
08:58:33BearFortress quits [Client Quit]
09:15:55lunik173 quits [Quit: Ping timeout (120 seconds)]
09:17:35lunik173 joins
09:17:41@Sanqui quits [Read error: Connection reset by peer]
09:18:09Sanqui joins
09:18:11Sanqui quits [Changing host]
09:18:11Sanqui (Sanqui) joins
09:18:11@ChanServ sets mode: +o Sanqui
09:23:55BearFortress joins
09:38:38decky_e quits [Client Quit]
09:39:32Ruthalas5 quits [Ping timeout: 258 seconds]
09:41:18Ruthalas5 (Ruthalas) joins
09:48:28decky_e (decky_e) joins
10:00:01railen63 quits [Remote host closed the connection]
10:00:18railen63 joins
10:01:02Matthww1 quits [Read error: Connection reset by peer]
10:01:20Matthww1 joins
10:07:57<flashfire42>https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project arkiver any idea?
10:15:01jacksonchen666 quits [Ping timeout: 245 seconds]
10:34:29<yts98>yzqzss|m: thanks for informing. I'll take a look at saveweb/fourdimensions-archive and the TG group
10:39:37jacksonchen666 (jacksonchen666) joins
10:49:02railen69 joins
10:51:59railen63 quits [Ping timeout: 258 seconds]
11:01:02<h2ibot>JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=49957&oldid=49931
11:24:53<@arkiver>yts98: can you please post what you have up to now for Banciyuan?
11:25:25<@arkiver>i'd like to get a project started for us - i'd like to check your code similarly as we've done with lineblog
11:25:51<flashfire42>https://transfer.archivete.am/ZwBIV/Starting%20CheckIP%20for%20Item.txt Got this on DPReview warrior project if someone can fix it or tell me stop being an idiot and running that project arkiver any idea? (This is something I posted an hour ago or so I just wanted to know I am aware IRC takes ages sometimes to get a reply)
11:26:37<@arkiver>flashfire42: this is unfortunately expected right now :/ dpreview needs some love which it will get
11:28:08<flashfire42>Duly noted. I will avoid it for the minute and ocassionally ask you
11:28:20<@arkiver>thank you!
11:29:55<@arkiver>yzqzss|m: you say 'web archiving is slow for it' - what does that mean? the site is slow? can the site not handle a lot?
11:34:04<flashfire42>is issuu ok to run arkiver ?
11:34:19<@arkiver>flashfire42: no :P
11:34:26<@arkiver>wait let me reorder this
11:35:02<flashfire42>Sorry I am just trying out all the different projects that are still there looking for stuff to run
11:35:50<flashfire42>You are already aware several projects dont work on the warrior anyway
11:37:17<@arkiver>flashfire42: i moved things around!
11:37:20<@arkiver>how does this look?
11:37:29<@arkiver>no need for sorry :)
11:37:56<flashfire42>How do you mean you moved things around?
11:38:16<flashfire42>Ok now I see it
11:38:19<flashfire42>thats better
11:40:23sepro quits [Ping timeout: 252 seconds]
11:53:37decky_e quits [Remote host closed the connection]
11:53:39<@arkiver>let's make a channel for banciyuan, any ideas?
11:54:31<flashfire42>#YuanInTheBank
11:54:51<flashfire42>it kinda sucks but I got nothing else
11:55:38<flashfire42>arkiver if you wanna even consider that one jump in there so I can op you cause I am currently babysitting it
11:55:39<@arkiver>yzqzss|m: if you have a list of URLs then yes please let us know!
11:55:50<@arkiver>we'll likely crawl this as well though, since we need to data in WARCs
11:55:51jacksonchen666 quits [Ping timeout: 245 seconds]
11:56:12<h2ibot>Nemo bis edited Miraheze (+318, /* Backups */ update): https://wiki.archiveteam.org/?diff=49958&oldid=49951
11:57:12<h2ibot>Nemo bis edited Miraheze (+77, /* Backups */ numbers): https://wiki.archiveteam.org/?diff=49959&oldid=49958
11:57:33<flashfire42>if #YuanInTheBank is liked can an archiveteam op join it I am babysitting it but going to bed in like half an hour to an hour dont wanna slow down any progress if you like my silly channel name
11:59:21jacksonchen666 (jacksonchen666) joins
12:00:13<h2ibot>JAABot edited CurrentWarriorProject (+1): https://wiki.archiveteam.org/?diff=49960&oldid=49957
12:02:43<@arkiver>flashfire42: i'm not sure about that one
12:02:50<@arkiver>rewby: lineblog is up and running
12:02:55<@arkiver>next one is up soon
12:03:27<@rewby|backup>Ack
12:06:24<@arkiver>rewby: can you please create a target for Banciyuan?
12:06:26<@arkiver>this would be
12:06:30<@arkiver>archiveteam_banciyuan_
12:06:36<@arkiver>banciyuan_
12:06:57<@arkiver>Archive Team εŠζ¬‘ε…ƒ (Banciyuan):
12:07:03<@rewby|backup>Is there a channel i should watch?
12:07:03<@arkiver>the tracker is up
12:07:06<@arkiver>not yet
12:07:15<@arkiver>i'll ping you when we have a channel
12:22:43<@rewby>arkiver: Target up
12:26:09<@arkiver>rewby: thank you!
12:28:14railen63 joins
12:30:30railen69 quits [Ping timeout: 258 seconds]
12:32:36c joins
12:39:20<h2ibot>Nemo bis edited Miraheze (+237, update): https://wiki.archiveteam.org/?diff=49961&oldid=49959
12:46:02railen69 joins
12:48:44railen63 quits [Ping timeout: 265 seconds]
12:50:40c quits [Ping timeout: 265 seconds]
12:54:53<yzqzss|m><arkiver> "yzqzss: if you have a list of..." <- Yes we have. They are stored in the database and take some time to export, please wait.
12:55:30<yzqzss|m>But I worry that archiveteam and we(stwp) may be doing duplicate work...🫠
12:56:07<@arkiver>yzqzss|m: we would partially be doing that yes
12:56:15<@arkiver>but do you see any slowness from the site if you archive fast?
12:56:16jacksonchen666 quits [Client Quit]
12:56:40<@arkiver>however, we have to get this data in WARCs with HTML, playable in services like the Wayback Machine.
12:57:52<@arkiver>yzqzss|m: if you have not seen signs that the site cannot handle much, then we'll likely move forward with this. if you do see signs the site cannot handle this, let's see further what we should do here
12:59:49<@arkiver>yzqzss|m: perhaps you have an idea for a channel name?
13:11:11<yzqzss|m><arkiver> "but do you see any slowness from..." <- We haven't tried web page, but at least its webapi/appapi doesn't have any WAF. Be aware though that they have staff monitoring gateway traffic.
13:11:47<@arkiver>yzqzss|m: sounds good in that case
13:11:55<@arkiver>do you plan to stick around at Archive Team for this?
13:12:04AlbertLarsan68 (AlbertLarsan68) joins
13:13:51<@arkiver>so Banciyuan literally means "Half Dimension" - maybe there is an English or Chinese pun on that that we can use for a channel name
13:15:19<yzqzss|m><arkiver> "yzqzss: we would partially be..." <- I think you guys want to archive banciyuan via webpage. Is your goal to archive the entire site?
13:16:46<@arkiver>yzqzss|m: yes, including images/videos/etc. i'm hoping it'll not be too big. if it turns out be very big, we'll see about perhaps dropping videos or lowering quality. we'd likely also include any API stuff we can find
13:18:32c joins
13:25:12AlbertLarsan68 quits [Client Quit]
13:25:19c quits [Remote host closed the connection]
13:26:32Arcorann quits [Ping timeout: 252 seconds]
13:27:02AlbertLarsan68 (AlbertLarsan68) joins
13:31:24<yzqzss|m>https://transfer.archivete.am/zj9Jf/banciyuan_user_ids.list.txt
13:32:22<yzqzss|m>wc -l --> 10601351
13:33:42<yzqzss|m>bcy.net/u/{user_id}
13:33:42<@arkiver>yzqzss|m: nice! how complete do you think the list is?
13:34:46<@arkiver>interesting, i see some long IDs and some short IDs. could the short IDs be from an earlier phase in which they were created sequentially?
13:35:45<yzqzss|m>arkiver: Yes!
13:37:56sepro (sepro) joins
13:38:11<@rewby>If I know arkiver at all, the next thought is "we'll iterate over all of them"
13:38:46<@arkiver>rewby: yes for the sequential ones, but some (perhaps later IDs?) seem to be not sequential
13:38:57<@arkiver>we'll queue up all sequential ones and of course keep doing discovery
13:41:59<yzqzss|m>The users in the early years are sequentially short IDs. Then one day, banciyuan assign random int64 long IDs for new users.
13:43:11<yzqzss|m><arkiver> "yzqzss: nice! how complete do..." <- 95%+, maybe.
13:43:41<@arkiver>very nice
13:45:00<AlbertLarsan68>I have a question about the Warrior tracker's webpage: Is it normal that the top-list (the left table) has a clicky-mouse-pointer, but nothing happens when I click?
13:46:39railen64 joins
13:46:44AmAnd0A quits [Ping timeout: 265 seconds]
13:47:19AmAnd0A joins
13:48:08<imer>highest sequential id seems to be 4.086.616, one after that is 5.287.119 -> 6.513.318 -> 6.704.473 -> 7.079.273
13:48:22<imer>https://transfer.archivete.am/WJcDX/banciyuan_user_ids.list_sorted.txt if anyone else is curious
13:49:06<imer>highest seems to be 4.503.256.006.921.376 with 3 outliers in the 7.168.705.518.750.483.509 range
13:49:28railen69 quits [Ping timeout: 258 seconds]
13:54:06<@arkiver>imer: is this data from the list of yzqzss|m ?
13:58:40AmAnd0A quits [Read error: Connection reset by peer]
13:58:59AmAnd0A joins
14:06:24<yzqzss|m>https://transfer.archivete.am/bxYDX/banciyuan_uids_from_ranking.txt
14:06:24<yzqzss|m>https://transfer.archivete.am/jUaAQ/banciyuan_item_ids_from_ranking.txt
14:06:24<yzqzss|m>---
14:06:24<yzqzss|m>These users and items are listed on the leaderboard. archive them first?
14:06:24<yzqzss|m>https://bcy.net/{illust|coser|novel}/toppost100
14:11:44railen63 joins
14:13:48railen64 quits [Ping timeout: 265 seconds]
14:19:02<imer>arkiver: yes, just sorted it out of curiosity
14:47:08railen69 joins
14:50:03railen63 quits [Ping timeout: 265 seconds]
14:50:11toss (toss) joins
15:00:24<nstrom|m>How about #bcbye as channel name? /shrug
15:01:06<@arkiver>it sounds pretty nice
15:03:48dumbgoy_ joins
15:06:28<BigBrain>bancigone
15:10:28<mattx433>begoneyuan?
15:11:28<yzqzss|m>We just extracted the picture URLs in all current items. After deduplication, there are about 67054075 pictures. (Not the final result, we are still crawling, and new items are still added to the DB)
15:12:12<@arkiver>yzqzss|m: how large do you think your final dump will be?
15:17:31railen64 joins
15:20:30railen69 quits [Ping timeout: 265 seconds]
15:23:52railen69 joins
15:26:10<yzqzss|m>IDK, but it should be tens~hundreds of TiB. And we haven't counted the number of videos yet.
15:26:27railen64 quits [Ping timeout: 258 seconds]
15:40:16<yzqzss|m>Also, I personally don't recommend archiving videos, downloading them in bulk will draw the attention of employees. And most of the videos are also available on BiliBili (www.bilibili.com)
15:45:48Minkafighter quits [Client Quit]
15:46:04Minkafighter joins
15:53:28<yzqzss|m>* A few years ago, ByteDance (parent company of TikTok) acquired banciyuan. Although the banciyuan team is gradually abandoning the maintenance of the website, ByteDance's traffic audit team is still there.
15:53:28<yzqzss|m>* The audit team banned our crawler account a few days ago. We have tried to contact them, but they are difficult to negotiate with :(
15:54:04railen64 joins
15:55:14sonick quits [Quit: Connection closed for inactivity]
15:55:36dumbgoy__ joins
15:57:14railen69 quits [Ping timeout: 265 seconds]
15:58:15Dango360 (Dango360) joins
15:59:39dumbgoy_ quits [Ping timeout: 265 seconds]
16:04:32<yts98>arkiver: I'm still working on handling image URL rules, and my code will cover webpages, webapis, images, videos, danmakus
16:07:09Aoede_ is now known as Aoede
16:07:34justmolamola quits [Remote host closed the connection]
16:10:23driib quits [Quit: The Lounge - https://thelounge.chat]
16:10:44driib (driib) joins
16:15:33<AlbertLarsan68>Is the team interested in archiving a French Torrent site (type TPB)?
16:15:33<AlbertLarsan68>The French legislation is chasing them, they have to change their domain constantly, to avoid DNS blocking my all major ISPs.
16:16:27<yzqzss|m>yts98: Do you have Telegram? Can you join https://t.me/saveweb_projects/319 to discuss about banciyuan archiving with us?
16:16:45driib quits [Remote host closed the connection]
16:17:04driib (driib) joins
16:17:35<yts98>yzqzss|m: ok, as my navite language is Traditional Chinese
16:24:13<yzqzss|m>πŸ‘οΈπŸ‘οΈ
16:30:07driib quits [Remote host closed the connection]
16:30:33driib (driib) joins
16:31:20toss quits [Ping timeout: 252 seconds]
16:33:57<cronfox>yts98: ohhhhh
16:35:06<yts98>discussing with STWP. I'll update findings here and on wiki
16:43:47driib quits [Remote host closed the connection]
16:44:07driib (driib) joins
16:44:53VerifiedJ quits [Quit: The Lounge - https://thelounge.chat]
16:46:29VerifiedJ (VerifiedJ) joins
16:53:00toss (toss) joins
16:54:41toss_ (toss) joins
16:56:04AmAnd0A quits [Read error: Connection reset by peer]
16:56:14AmAnd0A joins
16:57:20that_lurker quits [Client Quit]
16:57:29that_lurker (that_lurker) joins
16:57:36AmAnd0A quits [Read error: Connection reset by peer]
16:57:44toss quits [Ping timeout: 252 seconds]
16:57:52AmAnd0A joins
16:58:10driib quits [Remote host closed the connection]
16:58:29driib (driib) joins
17:12:46driib quits [Remote host closed the connection]
17:13:06driib (driib) joins
17:13:42toss (toss) joins
17:17:48BigBrain quits [Remote host closed the connection]
17:18:05BigBrain (bigbrain) joins
17:18:05toss_ quits [Ping timeout: 252 seconds]
17:26:00driib quits [Remote host closed the connection]
17:26:21driib (driib) joins
17:27:35@rewby quits [Ping timeout: 258 seconds]
17:34:55rewby (rewby) joins
17:34:55@ChanServ sets mode: +o rewby
17:35:41decky_e (decky_e) joins
17:39:25driib quits [Remote host closed the connection]
17:39:45driib (driib) joins
17:44:36decky_e quits [Read error: Connection reset by peer]
17:44:58decky_e (decky_e) joins
17:50:31driib quits [Remote host closed the connection]
17:50:50driib (driib) joins
17:52:22<kiska>I have a suggestion for channel name wuciyuan? :D
17:52:34<kiska>Or simplfied chinese 无欑元
17:52:49<kiska>Which translates to dimensionless
18:00:09toss leaves
18:02:03driib quits [Remote host closed the connection]
18:02:22driib (driib) joins
18:08:44driib quits [Remote host closed the connection]
18:09:03driib (driib) joins
18:13:26eroc1990 quits [Quit: The Lounge - https://thelounge.chat]
18:14:30_Dango360 quits [Quit: Leaving]
18:14:37Dango360 quits [Client Quit]
18:14:46Dango360 (Dango360) joins
18:14:57eroc1990 (eroc1990) joins
18:20:58<yts98>kiska: now there are δΊŒζ¬‘ε…ƒ (2), Banciyuan(1/2), saveweb/fourdimensions-archive (4), wuciyuan (null). lol
18:22:28<yts98>but 五欑元 (five dimensional) is also pronounced wuciyuan
18:23:24driib quits [Remote host closed the connection]
18:23:44driib (driib) joins
18:24:38decky_e quits [Ping timeout: 252 seconds]
18:25:58decky_e (decky_e) joins
18:27:05AmAnd0A quits [Read error: Connection reset by peer]
18:27:10AmAnd0A joins
18:28:14AmAnd0A quits [Read error: Connection reset by peer]
18:28:55AmAnd0A joins
18:30:26AmAnd0A quits [Read error: Connection reset by peer]
18:30:43AmAnd0A joins
18:34:38driib quits [Read error: Connection reset by peer]
18:34:57driib (driib) joins
18:45:59Megame (Megame) joins
18:46:26tzt quits [Remote host closed the connection]
18:46:48tzt (tzt) joins
18:46:53imer quits [Ping timeout: 265 seconds]
18:48:03railen64 quits [Remote host closed the connection]
18:48:37driib quits [Remote host closed the connection]
18:48:57driib (driib) joins
18:51:01railen63 joins
18:54:44imer (imer) joins
18:57:15<kiska>You could do 四欑元 for siciyuan and if you pronounce si in cantonese :D
18:57:50<kiska>Could be funny saying dead dimension :D
18:59:38<BigBrain>do irc channel names only allow ascii?
18:59:55<kiska>We do have #// for urls so perhaps?
19:00:46<kiska>05:00:23 [479]#无欑元 Illegal channel name
19:00:49<kiska>Sadness!
19:00:58<BigBrain>四欑元 worked
19:01:15<kiska>Yeah I tried that one first
19:01:38<BigBrain>maybe utf-8?
19:02:02<kiska>Traditional Chinese dimensionless is 焑欑元
19:02:30<BigBrain>works
19:03:15<kiska>Perhaps I should complain about the other one not working :D
19:03:34<kiska>Its either The Lounge not liking it or irc
19:04:07<BigBrain>weechat as well
19:04:12<BigBrain>probably irc
19:04:41driib quits [Remote host closed the connection]
19:05:00driib (driib) joins
19:06:41AlbertLarsan68 quits [Client Quit]
19:16:22tzt quits [Ping timeout: 265 seconds]
19:18:44driib quits [Remote host closed the connection]
19:19:02driib (driib) joins
19:23:12tzt (tzt) joins
19:26:29<@arkiver>yts98: i'd like to start it asap - perhaps you can post online what you have now? it'll likely be grately refactored anyway, just like with lineblog
19:27:58<@arkiver>yts98: the most useful information here is what is posted on the wiki - i've started a bit on a project now as would like to incorporate whatever you have in your code as well, to get this started as soon as possible.
19:28:06driib quits [Remote host closed the connection]
19:28:25driib (driib) joins
19:29:06<@arkiver>kiska: yts98: BigBrain: let's go with something in latin script
19:30:23<kiska>I'll part the 2 channels I made then
19:31:13<@arkiver>let's go with #wuciyuan
19:34:09driib quits [Remote host closed the connection]
19:34:28decky_e quits [Ping timeout: 258 seconds]
19:34:29driib (driib) joins
19:34:58decky_e (decky_e) joins
19:35:30Twisty joins
19:35:40<@arkiver>yzqzss|m: i mean - is this what you plan to archive, or would this be the total size given unlimited space to archive?
19:38:08AmAnd0A quits [Read error: Connection reset by peer]
19:38:14AmAnd0A joins
19:39:19AmAnd0A quits [Read error: Connection reset by peer]
19:39:20<Twisty>arkiver Regarding Vine, I don't know about coding or how it works, it's a thing that just came to mind since I've been an avid Vine consumer back in 2015 and while there were and still are compilations on YT, I don't think there has ever been an _actual_ archive made of them besides the official Twitter one which more or less shut down in 2019.
19:39:47Island joins
19:39:56AmAnd0A joins
19:40:31<BigBrain>Twisty: anywhere to scrape for ids? or just bruteforce slowly?
19:40:59<Twisty>I'd guess bruteforcing, here is a link for example https://vine.co/v/5eX56hWvxBU
19:42:15<Twisty>https://vine.co/u/923279753992089600 There are also links to profiles that look like this
19:42:47AmAnd0A quits [Read error: Connection reset by peer]
19:43:02AmAnd0A joins
19:44:32<h2ibot>Arkiver uploaded File:Banciyuan-icon.png: https://wiki.archiveteam.org/?title=File%3ABanciyuan-icon.png
19:46:40driib quits [Remote host closed the connection]
19:46:58driib (driib) joins
19:50:22nix78 joins
19:58:32carnage joins
19:59:49driib quits [Remote host closed the connection]
20:00:08driib (driib) joins
20:00:40imer quits [Client Quit]
20:02:42imer (imer) joins
20:08:45andrew quits [Quit: ]
20:11:16driib quits [Remote host closed the connection]
20:11:34driib (driib) joins
20:12:03carnage56 joins
20:12:46carnage56 quits [Remote host closed the connection]
20:13:13BigBrain quits [Remote host closed the connection]
20:13:33BigBrain (bigbrain) joins
20:15:11<nicolas17>Twisty: those vine IDs seem too big to reasonably bruteforce
20:15:36<masterX244>seems like the WARCrippers need to be fired up again...
20:15:42<Twisty>:C
20:15:44<masterX244>if we want to hunt down links
20:15:49carnage quits [Ping timeout: 265 seconds]
20:19:42<Twisty>I guess that makes sense, I just hope it'd be possible to archive them. I don't think Vine is as important as Reddit or Imgur but considering its impact it had and the content created troughout mid 2010s, it would be worth taking a look at possible archival options.
20:23:17driib quits [Remote host closed the connection]
20:23:38driib (driib) joins
20:24:29Megame quits [Client Quit]
20:28:39<masterX244>too bad that common-crawl is still off-limit for processing after we got amazon to wave a white flag after we got sustained 120GBit/S on grepping data
20:31:12<@arkiver>note to anyone who wants to start running bcy.net - your IP will be archived in the page source
20:36:25TTN joins
20:36:50driib quits [Remote host closed the connection]
20:37:10driib (driib) joins
20:37:57<fireonlive>:o
20:41:39Twisty quits [Remote host closed the connection]
20:41:49<kiska>Thats problematic :D
20:45:55<flashfire42>arkiver well that fucking sucks. Will it be a warrior project?
20:50:40driib quits [Remote host closed the connection]
20:50:58driib (driib) joins
20:58:46<h2ibot>Nemo bis edited Wikimedia Commons (+287, /* Other dumps */ we're not quite planning to…): https://wiki.archiveteam.org/?diff=49963&oldid=48033
21:00:13andrew (andrew) joins
21:01:48Unholy2361 quits [Quit: The Lounge - https://thelounge.chat]
21:02:13tzt quits [Ping timeout: 265 seconds]
21:02:19Unholy2361 (Unholy2361) joins
21:02:58driib quits [Remote host closed the connection]
21:03:17driib (driib) joins
21:03:24tzt (tzt) joins
21:04:00<rktk>Vine??
21:04:07<rktk>Are those links even still valid? I can't imgaine
21:05:10<fireonlive>vine.co hosts an archive
21:05:28<fireonlive>they used to have 'top vines' etc but have since killed that
21:06:09<fireonlive>searching site:vine.co reveals usernames like https://vine.co/TheGabbieShow/
21:07:23<fireonlive>direct links seem to be 'hard to get' from there but here's a random ID: https://vine.co/v/iU3uhlJrzBm
21:07:38<fireonlive>they're quite long
21:08:04<rktk>oh wow, Vine.co is still online
21:08:21driib quits [Remote host closed the connection]
21:08:26<flashfire42>Could have it be like the #down-the-tube channel where individual vids can be queued or whole lists like #telegrab
21:08:35<rktk>but Vine is owned by Twitter, and with Elon's shitstorm, it could in theory be dumped eh
21:08:38<nicolas17>yep there's no way we can bruteforce 52036560683837093888 IDs :P
21:08:38driib (driib) joins
21:08:44<nicolas17>rktk: shh
21:08:45<rktk>"Why are we hosting all these old videos nobody is watching anymore"
21:08:50<nicolas17>elon probably doesn't remember vine exists
21:08:53<rktk>^
21:09:05tzt quits [Ping timeout: 252 seconds]
21:09:10<rktk>I'll put some old Vine dumps I grabbed for a few channels I liked, this was before they put it in archive state
21:09:11<fireonlive>unless iU3uhlJrzBm is some kinda encoded thing :p
21:09:17<rktk>that looks like a Youtube ID
21:09:29<fireonlive>i think elon said he wanted to bring vine back.. but he's said a lot of things
21:11:23<fireonlive>i randomly found this; unsure if it's anything: https://gist.github.com/bradserbu/4d3a10b54cb8895e6ca7
21:11:41<nicolas17>wth
21:11:55<fireonlive>might be something else
21:12:07<rktk>Actually it seems my vine archives are a bit useless because they have JSON video URL pointers to the old URLs
21:12:12<fireonlive>oh it says vine.co in the comments
21:12:15<rktk>and it seems most if not all of these videos still exist on vine
21:12:51<fireonlive>vine.co still has some sort of takedown process (documented, anyways)
21:13:25<fireonlive>https://help.twitter.com/en/using-twitter/vine-faqs "Yes. If you wish to delete your Vine account, you can let us know by emailing us at vinehelp@twitter.com with a link to your Vine profile page (i.e., https://vine.co/MyUserName)."
21:13:37<fireonlive>idk what, if any, sort of verification process they'd use there
21:13:55<@arkiver>let's move vine chat to #vinewhine
21:16:26<rktk>Done sorry.
21:20:21driib quits [Remote host closed the connection]
21:20:40driib (driib) joins
21:20:52Ruthalas5 quits [Client Quit]
21:21:11Ruthalas5 (Ruthalas) joins
21:30:02driib quits [Remote host closed the connection]
21:30:22driib (driib) joins
21:30:51<h2ibot>Nemo bis edited Wikimedia Commons (+221, update query): https://wiki.archiveteam.org/?diff=49964&oldid=49963
21:33:07hitgrr8 quits [Client Quit]
21:35:19driib quits [Remote host closed the connection]
21:35:39driib (driib) joins
21:41:08driib quits [Remote host closed the connection]
21:41:24driib (driib) joins
21:47:18driib quits [Read error: Connection reset by peer]
21:47:38driib (driib) joins
21:53:32driib quits [Remote host closed the connection]
21:53:52driib (driib) joins
22:00:57<h2ibot>JAABot edited CurrentWarriorProject (-1): https://wiki.archiveteam.org/?diff=49965&oldid=49960
22:04:29Misty joins
22:05:09driib quits [Remote host closed the connection]
22:05:28driib (driib) joins
22:07:03<Misty>@arkiver @nicolas17 can IA store the picture we downloading from banciyuan?
22:07:24<Misty>should be over 200TB & about 0.6 billion
22:08:35<nicolas17>I haven't looked into banciyuan at all yet
22:08:50<Misty>oh, i see you in the chat history
22:09:09<nicolas17>yeah but I didn't pay attention to the details of that project :D
22:09:39<nicolas17>the imgur project has archived 590TB so far
22:09:40<Misty>well i'm just asking, we will go on with or without IA, but getting any kind of support (like storage) will always be good
22:10:05<Misty>yeah, but imgur is pretty famous lo
22:10:07<Misty>yeah, but imgur is pretty famous lol
22:10:27<nicolas17>I have no contact with IA to say if that's okay or not, so yeah wait for ark1ver
22:11:07driib quits [Remote host closed the connection]
22:11:25driib (driib) joins
22:11:28<Misty>nicolas17 thanks, do you know other people have relation with IA?
22:20:17<yts98>just uploaded untested code to github.com/yts98/banciyuan-grab . arkiver yzqzss|m can take a look.
22:21:51<Misty>oh @yts98 got you online :)
22:22:17<Misty>so can ia hosting pictures downloaded by us?
22:23:00driib quits [Remote host closed the connection]
22:23:19driib (driib) joins
22:24:40decky_e quits [Ping timeout: 258 seconds]
22:25:04nix78 quits [Remote host closed the connection]
22:25:09decky_e (decky_e) joins
22:25:57<@arkiver>Misty: depends on how much
22:26:13<@arkiver>200 TB?
22:26:20<@arkiver>no, that will likely not happen
22:26:24skyrocket quits [Client Quit]
22:26:50<Misty>arkiverus thanks, usually how large would they accepting?
22:26:58<Misty>arkiver thanks, usually how large would they accepting?
22:27:14<@arkiver>Misty: banciyuan has multiple sizes of images, smaller ones and "original" quality images. which sizes are you talking about?
22:27:59<Misty>i mean total size, and we are backuping 'original" images
22:28:00<@arkiver>Misty: yts98: yzqzss|m: let's move to #wuciyuan
22:28:32fishingforsoup joins
22:30:02<h2ibot>Yts98 edited Banciyuan (+2199, Sync findings with Save The Web Project): https://wiki.archiveteam.org/?diff=49966&oldid=49954
22:35:58skyrocket joins
22:36:27driib quits [Remote host closed the connection]
22:36:28decky_e quits [Ping timeout: 265 seconds]
22:36:45driib (driib) joins
22:36:50decky_e (decky_e) joins
22:37:09skyrocket quits [Client Quit]
22:42:09sonick (sonick) joins
22:48:03decky_e quits [Ping timeout: 258 seconds]
22:49:54decky_e (decky_e) joins
22:50:46driib quits [Remote host closed the connection]
22:51:05driib (driib) joins
22:52:05<h2ibot>Yts98 edited Banciyuan (+242, Update IRC and describe image URLs): https://wiki.archiveteam.org/?diff=49967&oldid=49966
22:55:47decky_e quits [Ping timeout: 252 seconds]
22:56:27decky_e (decky_e) joins
23:02:25decky_e quits [Remote host closed the connection]
23:04:13driib quits [Remote host closed the connection]
23:04:32driib (driib) joins
23:06:14Ruthalas5 quits [Ping timeout: 252 seconds]
23:08:48Ruthalas5 (Ruthalas) joins
23:17:04Ruthalas5 quits [Ping timeout: 265 seconds]
23:19:26AmAnd0A quits [Ping timeout: 252 seconds]
23:19:34driib quits [Remote host closed the connection]
23:19:43AmAnd0A joins
23:19:53driib (driib) joins
23:21:13Ruthalas5 (Ruthalas) joins
23:22:55Ruthalas5 quits [Client Quit]
23:27:19rocketdive joins
23:28:41AmAnd0A quits [Ping timeout: 258 seconds]
23:28:46<nicolas17>a friend wants to more systematically archive stuff from Apple's "update catalogs", such as the macOS InstallAssistant.pkg
23:30:22<nicolas17>and he asked about deduplication, since when a beta version is published as both "developer beta" and "public beta", you end up with two different URLs with the same 12GB file
23:30:46<nicolas17>so I suggested using WARCs
23:31:17<nicolas17>get deduplication + preserve headers (although they don't matter much in practice) + they work on the WBM
23:32:00<nicolas17>except... only some people are allowed to upload WARCs such that they get indexed by WBM, so that's the biggest advantage killed
23:34:16driib quits [Remote host closed the connection]
23:34:21<@arkiver>nicolas17: it would be awesome to have those
23:34:26<@arkiver>can't we use archivebot for that?
23:34:36driib (driib) joins
23:35:16<nicolas17>would archivebot dedup those? 12GB seems pretty big to store twice...
23:35:29<@JAA>AB does not dedupe anything.
23:35:42<@arkiver>well
23:35:49<@arkiver>question is how often would it get new 12 GBs?
23:35:52<@JAA>This would fit perfectly into my planned software binaries thing, but ETA unknown.
23:35:59<@arkiver>once a month? i wouldn't worry about deduplication
23:36:06<@arkiver>once per hour? maybe deduplication would be nice
23:36:36<nicolas17>https://archive.org/details/sucatalog_012-71768 someone manually uploaded these a while ago (many of which were already deleted from Apple CDN!)
23:36:55<nicolas17>and in that example, the same file is stored 4 times
23:37:07<nicolas17>{public beta, dev beta} x {http, https}
23:37:19<flashfire42>Hash the entries with SHA256 if you get a match on 2 things flag for manual review to start deduplication. Tho that could set a dangerous precedent
23:37:34<@arkiver>you don't have to get them with both http and https, once of two is fine
23:37:40<@arkiver>the two URLs are important though
23:37:49<@arkiver>nicolas17: how often are these released?
23:38:38<nicolas17>I think there's multiple 'catalog' XML files and some link to the pkg in http and some in https, so that's where that dup came from, wget blindly saved both
23:40:31rocketdive quits [Client Quit]
23:40:53quackifi joins
23:40:57decky_e joins
23:41:43<nicolas17>arkiver: https://theapplewiki.com/wiki/Beta_Firmware/Mac/13.x this should give a rough idea of frequency
23:43:14<h2ibot>LostArchiver edited Project Sonar (+642, Public no longer has access to data.): https://wiki.archiveteam.org/?diff=49968&oldid=46907
23:44:16dhinakg joins
23:44:58<@arkiver>nicolas17: if it is a major obstacle to not duplicate data, then feel free to duplicate the 12 GB here for both URL
23:45:00<@arkiver>URLs*
23:45:23<nicolas17>dhinakg: https://hackint.logs.kiska.pw/archiveteam-bs/20230618#c353318
23:45:58<dhinakg>thx
23:46:56<nicolas17>arkiver: well, "throwing the URLs into archivebot" would be the easiest way to get them into WBM, anything else would need messing with WARC-creating tools, doing our own download/upload, and being whitelisted to get said WARCs into a WBM-indexed collection
23:46:58<dhinakg>so i'm not worried about http/https, that's very easy to deduplicate as the links for the "products" (term for each entry in a catalog) are identical and the only difference is http/https
23:47:19quackifi quits [Read error: Connection reset by peer]
23:47:26<dhinakg>but commonly apple will reupload a developer beta onto the public beta catalog, with different links, even though it's the exact same files
23:47:35<@arkiver>nicolas17: agreed, AB seems like the best options
23:47:42<@arkiver>if it handles the 12 GB files well
23:48:03<@JAA>It does, on a suitable pipeline and with concurrency 1.
23:48:32quackifi joins
23:48:37<dhinakg>however the metadata on the public beta catalog would be different
23:48:45driib quits [Remote host closed the connection]
23:49:04driib (driib) joins
23:53:05BlueMaxima joins
23:56:57driib quits [Read error: Connection reset by peer]
23:57:15driib (driib) joins
23:58:57dhinakg quits [Remote host closed the connection]