Category talk:Pronunciation

Template[edit]

Does exist any template for the description of Pronunciation files? See for example template:English spoken article. -- Andrew Krizhanovsky (talk) 22:24, 23 January 2010 (UTC)[reply]

No, there is no need for something like this, pronunciation files have very simple description. See File:En-uk-I can't.ogg and File:Pl-trzysta.ogg on how to write the description correctly. Please remember about the proper naming of the file including language ISO code. --Derbeth ^talk 18:53, 25 January 2010 (UTC)[reply]

Ok. Thank you. -- Andrew Krizhanovsky (talk) 09:41, 26 January 2010 (UTC)[reply]

Simple how-to[edit]

Here's how to record a bunch of words on a Ubuntu Linux platform. Use 'synaptic' or 'apt-get' to install the necessary software packages, such as 'alsa-utils' and 'sox'. Compile a list of words that you want to record.

#!/bin/sh
lang=sv
while read word
  do
  echo $word
  arecord -r 100000 -d 4 $lang-$word.wav
  sox $lang-$word.wav $lang-$word.ogg norm vad -p .25 reverse vad -p .25  reverse
  done

Here, "sv" (Swedish) is set as the language, which will be used as a filename prefix. "echo" prints the word as a prompt. "arecord" records four seconds of audio in 100 kbit/s, including any initial and trailing silence, so you don't have to press any key to start and stop the recording. "sox" then converts the recorded wav file to the free and open Ogg Vorbis format, but first it normalizes the sound level, truncates initial and trailing silence, but keeping .25 seconds of silence margin. --LA2 (talk) 01:28, 15 March 2013 (UTC)[reply]

Can it cut a single record to many words? Infovarius (talk) 13:57, 22 March 2017 (UTC)[reply]

Statistics[edit]

The language categories having most files (not in subcategories, at least 20 files) are:

On September 16, 2015: Dutch (334705), Polish (22423), Ukrainian (16178), German (15048), French (13535), Belarusian (8637), Tamil (8086), Russian (6345), Chinese (4932), Swedish (4344), Hungarian (3623), Czech (3075), Latvian (1944), Jèrriais (1769), Arabic (1519), Armenian (1259), Italian (1075), Latin (655), Farsi (492), Navajo (454), Malagasy (443), Telugu (352), Norwegian (325), Spanish (319), Adyghe (288), Portuguese (264), Esperanto (260), Finnish (257), Welsh (219), Vietnamese (200), Lithuanian (165), Georgian (155), Icelandic (152), Galician (145), English (143), Danish (129), Tagalog (128), Turkish (127), Hebrew (123), Romanian (100), Greek (86), Odia (82), Kölsch (79), Macedonian (78), Thai (76), Nepali (72), Hindi (63), Croatian (59), Bashkir (56), Slovak (51), Irish (51), Devanagari (51), Korean (47), Mbunda (40), Bulgarian (30), Catalan (26), Sanskrit (23), Twi (22). --LA2 (talk) 20:06, 16 September 2015 (UTC)[reply]

Why not in subcategories? Infovarius (talk) 13:56, 22 March 2017 (UTC)[reply]

Right, subcategories should be included. But just for comparison, here is an updated count of the top-level files, with remarkable improvements in boldface. The table below shows a summary for 5 levels of subcategories. --LA2 (talk) 11:22, 19 May 2017 (UTC)[reply]

On May 19, 2017: Dutch (436804), Polish (23515), Russian (17354), Ukrainian (16182), Belarusian (8634), Chinese (4991), Armenian (4546), Swedish (4370), Hungarian (3701), Czech (3078), French (2933), Luxembourgish (2920), Odia (1973), Latvian (1946), Jèrriais (1769), Arabic (1537), Italian (1123), Latin (651), Wolof (586), Hebrew (577), Persian (558), English (543), Esperanto (483), Navajo (455), Malagasy (443), Telugu (386), Spanish (330), Adyghe (327), Norwegian (326), Upper Sorbian (312), Portuguese (294), Finnish (264), Welsh (228), Vietnamese (201), Lithuanian (166), Georgian (157), Galician (155), Icelandic (149), Turkish (134), Danish (132), Tagalog (128), Bengali (116), Romanian (104), Bashkir (90), Thai (89), Greek (81), Kölsch (79), Macedonian (76), Korean (74), Nepali (73), Hindi (67), Croatian (59), Bulgarian (53), Devanagari (51), Slovak (50), Pronunciation of Kannada alphabet‎|49), Irish (47), Oromo (42), Mbunda (41), Twi (40), Voice spectrograms‎|31), Catalan (30), Sanskrit (24), Albanian (23), Limburgish (22), Ancient Greek (22).

Date	Dutch	German	English	Polish	Russian	French	Ukrai- nian	Ta- mil	Bela- rusian	Chi- nese	Arme- nian	Swe- dish	Czech	Hunga- rian	Luxem- bourgish	Jèrriais	Ser- bian	Odia	Lat- vian	Ara- bic	Ita- lian
May 19, 2017	439887	53605	24848	23948	18032	17608	16190	8719	8639	5967	4634	4605	3885	3722	2921	2310	2147	2048	2039	1695	1393
August 8, 2017	445308	62899	25102	23946	19055	17616	16192	8726	8639	5968	4636	4585	3880	3718	3483	2310	2147	2273	2039	1706	1634
CatScan	nl	de	en	pl	ru	fr	uk	ta	be	zh	hy	sv	cs	hu	lb	nrf	sr	or	lv	ar	it

Languages with less than 1000 words

Date	Spa- nish	Ale- mannic	Ro- mansh	Espe- ranto	Per- sian	La- tin	Portu- guese	He- brew	Nor- wegian	Bas- que	Arpi- tan	Te- lugu	Welsh	Ady- ghe	Greek	Ice- landic	Fin- nish	Japa- nese	Roma- nian	Gali- cian	Mace- donian	Slo- vene	Ka- zakh	Slo- vak	Alba- nian
May 19, 2017	971	828	823	761	726	719	654	577	517	-	420	401	366	347	276	275	269	252	245	214	136	70	54	51	24
August 8, 2017	968	882	823	762	953	719	652	582	564	511	420	401	366	347	276	274	269	253	245	214	138	67	54	52	24
CatScan	es	gsw	rm	eo	fa	la	pt	he	no	eu	frp	te	cy	ady	el	is	fi	ja	ro	gl	mk	sl	kk	sk	sq

Let's move the images[edit]

Because this category should frequently be scanned by the Wiktionary updater bots, I propose to move its few images in its parent category Category:Phonology, and to rename it Category:Pronunciations according to Commons:Naming_categories#Grammatical_number. JackPotte (talk) 23:40, 24 February 2017 (UTC)[reply]

I'm not sure which problems will this solve. My bot scans these categories and ignores images, it's trivially easy to implement. The problems I deal with the most are files which don't follow the naming rules of this category. Besides, all files directly put here (i.e., not in any subcategory) are useless for automatic processing by definition - they are not assigned to any language. So for me it's not important if there are images among them. --Derbeth ^talk 08:04, 22 March 2017 (UTC)[reply]

Sorry if I could let think that it was about some programming difficulties of an hypothetical bot... I was actually talking about the crawling execution time optimization of a frequent task, so performances and ecology (several hours per year). JackPotte (talk) 08:41, 22 March 2017 (UTC)[reply]

I still don't see how removing a few images would matter for a category containing tens of thousands of images in its subcategories. Every bot scanning Category:Pronunciation should have a whitelist of extensions (ogg, oga, wav) and ignore other files. --Derbeth ^talk 06:11, 23 March 2017 (UTC)[reply]

Category talk:Pronunciation

Contents

Template[edit]

Simple how-to[edit]

Statistics[edit]

Let's move the images[edit]

Navigation menu

Category talk:Pronunciation

Template[edit]

Simple how-to[edit]

Statistics[edit]

Let's move the images[edit]

Navigation menu

Search