Category talk:Pronunciation

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Template[edit]

Does exist any template for the description of Pronunciation files? See for example template:English spoken article. -- Andrew Krizhanovsky (talk) 22:24, 23 January 2010 (UTC)

No, there is no need for something like this, pronunciation files have very simple description. See File:En-uk-I can't.ogg and File:Pl-trzysta.ogg on how to write the description correctly. Please remember about the proper naming of the file including language ISO code. --Derbeth talk 18:53, 25 January 2010 (UTC)
Ok. Thank you. -- Andrew Krizhanovsky (talk) 09:41, 26 January 2010 (UTC)

Simple how-to[edit]

Here's how to record a bunch of words on a Ubuntu Linux platform. Use 'synaptic' or 'apt-get' to install the necessary software packages, such as 'alsa-utils' and 'sox'. Compile a list of words that you want to record.

#!/bin/sh
lang=sv
while read word
  do
  echo $word
  arecord -r 100000 -d 4 $lang-$word.wav
  sox $lang-$word.wav $lang-$word.ogg norm vad -p .25 reverse vad -p .25  reverse
  done

Here, "sv" (Swedish) is set as the language, which will be used as a filename prefix. "echo" prints the word as a prompt. "arecord" records four seconds of audio in 100 kbit/s, including any initial and trailing silence, so you don't have to press any key to start and stop the recording. "sox" then converts the recorded wav file to the free and open Ogg Vorbis format, but first it normalizes the sound level, truncates initial and trailing silence, but keeping .25 seconds of silence margin. --LA2 (talk) 01:28, 15 March 2013 (UTC)

Can it cut a single record to many words? Infovarius (talk) 13:57, 22 March 2017 (UTC)

Statistics[edit]

The language categories having most files (not in subcategories, at least 20 files) are:

On September 16, 2015: Dutch (334705), Polish (22423), Ukrainian (16178), German (15048), French (13535), Belarusian (8637), Tamil (8086), Russian (6345), Chinese (4932), Swedish (4344), Hungarian (3623), Czech (3075), Latvian (1944), Jèrriais (1769), Arabic (1519), Armenian (1259), Italian (1075), Latin (655), Farsi (492), Navajo (454), Malagasy (443), Telugu (352), Norwegian (325), Spanish (319), Adyghe (288), Portuguese (264), Esperanto (260), Finnish (257), Welsh (219), Vietnamese (200), Lithuanian (165), Georgian (155), Icelandic (152), Galician (145), English (143), Danish (129), Tagalog (128), Turkish (127), Hebrew (123), Romanian (100), Greek (86), Odia (82), Kölsch (79), Macedonian (78), Thai (76), Nepali (72), Hindi (63), Croatian (59), Bashkir (56), Slovak (51), Irish (51), Devanagari (51), Korean (47), Mbunda (40), Bulgarian (30), Catalan (26), Sanskrit (23), Twi (22). --LA2 (talk) 20:06, 16 September 2015 (UTC)

Why not in subcategories? Infovarius (talk) 13:56, 22 March 2017 (UTC)
Right, subcategories should be included. But just for comparison, here is an updated count of the top-level files, with remarkable improvements in boldface. The table below shows a summary for 5 levels of subcategories. --LA2 (talk) 11:22, 19 May 2017 (UTC)

On May 19, 2017: Dutch (436804), Polish (23515), Russian (17354), Ukrainian (16182), Belarusian (8634), Chinese (4991), Armenian (4546), Swedish (4370), Hungarian (3701), Czech (3078), French (2933), Luxembourgish (2920), Odia (1973), Latvian (1946), Jèrriais (1769), Arabic (1537), Italian (1123), Latin (651), Wolof (586), Hebrew (577), Persian (558), English (543), Esperanto (483), Navajo (455), Malagasy (443), Telugu (386), Spanish (330), Adyghe (327), Norwegian (326), Upper Sorbian (312), Portuguese (294), Finnish (264), Welsh (228), Vietnamese (201), Lithuanian (166), Georgian (157), Galician (155), Icelandic (149), Turkish (134), Danish (132), Tagalog (128), Bengali (116), Romanian (104), Bashkir (90), Thai (89), Greek (81), Kölsch (79), Macedonian (76), Korean (74), Nepali (73), Hindi (67), Croatian (59), Bulgarian (53), Devanagari (51), Slovak (50), Pronunciation of Kannada alphabet‎|49), Irish (47), Oromo (42), Mbunda (41), Twi (40), Voice spectrograms‎|31), Catalan (30), Sanskrit (24), Albanian (23), Limburgish (22), Ancient Greek (22).

Date Dutch German English Polish Russian French Ukrai-
nian
Ta-
mil
Bela-
rusian
Chi-
nese
Arme-
nian
Swe-
dish
Czech Hunga-
rian
Luxem-
bourgish
Jèrriais Ser-
bian
Odia Lat-
vian
Ara-
bic
Ita-
lian
May 19, 2017 439887 53605 24848 23948 18032 17608 16190 8719 8639 5967 4634 4605 3885 3722 2921 2310 2147 2048 2039 1695 1393
August 8, 2017 445308 62899 25102 23946 19055 17616 16192 8726 8639 5968 4636 4585 3880 3718 3483 2310 2147 2273 2039 1706 1634
CatScan nl de en pl ru fr uk ta be zh hy sv cs hu lb nrf sr or lv ar it
Languages with less than 1000 words
Date Spa-
nish
Ale-
mannic
Ro-
mansh
Espe-
ranto
Per-
sian
La-
tin
Portu-
guese
He-
brew
Nor-
wegian
Bas-
que
Arpi-
tan
Te-
lugu
Welsh Ady-
ghe
Greek Ice-
landic
Fin-
nish
Japa-
nese
Roma-
nian
Gali-
cian
Mace-
donian
Slo-
vene
Ka-
zakh
Slo-
vak
Alba-
nian
May 19, 2017 971 828 823 761 726 719 654 577 517 - 420 401 366 347 276 275 269 252 245 214 136 70 54 51 24
August 8, 2017 968 882 823 762 953 719 652 582 564 511 420 401 366 347 276 274 269 253 245 214 138 67 54 52 24
CatScan es gsw rm eo fa la pt he no eu frp te cy ady el is fi ja ro gl mk sl kk sk sq

Let's move the images[edit]

Because this category should frequently be scanned by the Wiktionary updater bots, I propose to move its few images in its parent category Category:Phonology, and to rename it Category:Pronunciations according to Commons:Naming_categories#Grammatical_number. JackPotte (talk) 23:40, 24 February 2017 (UTC)

I'm not sure which problems will this solve. My bot scans these categories and ignores images, it's trivially easy to implement. The problems I deal with the most are files which don't follow the naming rules of this category. Besides, all files directly put here (i.e., not in any subcategory) are useless for automatic processing by definition - they are not assigned to any language. So for me it's not important if there are images among them. --Derbeth talk 08:04, 22 March 2017 (UTC)
Sorry if I could let think that it was about some programming difficulties of an hypothetical bot... I was actually talking about the crawling execution time optimization of a frequent task, so performances and ecology (several hours per year). JackPotte (talk) 08:41, 22 March 2017 (UTC)
I still don't see how removing a few images would matter for a category containing tens of thousands of images in its subcategories. Every bot scanning Category:Pronunciation should have a whitelist of extensions (ogg, oga, wav) and ignore other files. --Derbeth talk 06:11, 23 March 2017 (UTC)