Commons:Bots/Requests/Smallbot 9

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Smallbot (talk · contribs)

Operator: Smallman12q (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: To fulfill Commons:Batch_uploading#VOA_pronunciation_sound_files. Uploading ~6500 pronunciation files from http://names.voa.gov

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): Initial one run, followed by monthly run

Maximum edit rate (e.g. edits per minute): 10-15, as fast it uploads

Bot flag requested: (Y/N): No

Programming language(s): Python3.2 w/ requests, beautifulsoup4. ffmpeg for conversion.

Source
#!/usr/bin/env python3.2
# -*- coding: utf-8 -*-

#For uploading files from names.voa.gov to commons

from bs4 import BeautifulSoup
import requests
from subprocess import call
import os.path
import traceback
from PyRWiki import Wiki #Requests based wrapper for api
from p import p

DEBUG=False

#http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents
#http://stackoverflow.com/questions/10993612/python-removing-xa0-from-string
#http://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python
def textOf(soup):
    return ' '.join(''.join(soup.findAll(text=True)).replace('\xa0', ' ').strip().split())

#make first letter of word Upper if after space or "-", only single space/-
def fixname(oldname):
    oldname=oldname.strip()
    fixed=""
    lastwasspace=True#Make first letter upper
    for i in oldname:
        if lastwasspace:
            fixed += i.upper()
            lastwasspace=False
        else:
            if i == " " or i == "-":
                lastwasspace = True
            else:
                lastwasspace = False
            fixed += i.lower()
    return fixed

def log(stuff):
    print(stuff)

#Log in
commons = Wiki("https://commons.wikimedia.org/w/api.php","Smallbot")
commons.login('Smallbot',p.bP)
commons.setEditToken()

counter= 2 #starts at 2
log('Checking for last id.')
if os.path.isfile('last.txt'):
    with open('last.txt', 'r') as content_file:
        counter = int(content_file.read())
        log('Last id found: ' + str(counter))
else:
    log('No prior id found. Starting at 2.')

session=requests.session()
if DEBUG:
    session.proxies = {'http': 'http://localhost:8888'}
session.headers = {'Referer': 'https://commons.wikimedia.org/wiki/Commons:Batch_uploading/VOA_pronunciation_sound_files'}

lastsuccess=counter
reached404=0
try:
    while reached404 < 25: # up to 25 can be skipped
        r = session.get('http://names.voa.gov/modal.phrasedetail.php?id=' + str(counter))

        #if r.status_code == 404:
        if "Cannot find the requested name" in r.text:
            reached404 += 1
            log('404 reached for ' + str(counter))
        else:
            reached404=0#reset 404 counter
            lastsuccess=counter
            soup=BeautifulSoup(r.content)
            soupbody=soup.select('div.modal-body')[0]
            if textOf(soupbody) != "How do you say ?":
                name=textOf(soupbody.select("h2")[0])[15:-1] # remove "How do you say" and '?'
                name=fixname(name)
                pronounce= textOf(soupbody.select('p')[0])
                region=textOf(soup.select('h4')[0].findNext('p'))#('h4 + p')[0]) #Adjacent sibling selector
                if textOf(soup.select('h4')[0]) != 'Region':
                    region=''

                r=session.get('http://names.voa.gov/sounds/' + str(counter) + '.mp3')
                r.raise_for_status() #should be no errors

                log('---------------------------------')
                log('ID: ' + str(counter))
                log('Name: ' + name)
                log('Pronounce: ' + repr(pronounce))
                log('Region: ' + region)
                log(str(len(r.content)) + ' bytes')

                with open('data.mp3','wb') as voamp3:
                    voamp3.write(r.content)

                filedesc="{{Information\n" +\
                            "|description= {{VOA pronunciation|term=" + name + "|region=" + region + "|transliteration=" +  pronounce + "}}\n" +\
                            "|date= 2013\n" +\
                            "|source= VOA pronunciation guide: [http://names.voa.gov/modal.phrasedetail.php?id="  + str(counter) + " " + name + "]\n" +\
                            "|author= Jim Tedder\n" +\
                            "|permission= {{PD-USGov-VOA}}\n" +\
                            "|other_versions=\n" +\
                            "}}\n"
                if os.path.exists('data.ogg'):
                    os.remove('data.ogg')
                #call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.webm'])
                call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.ogg']) #use .ogg instead

                if region != '':
                    region = ' from ' + region
                commons.upload(title="En-us-" + name + region + ' pronunciation (Voice of America).ogg',
                               filelocation='data.ogg',
                               text=filedesc,
                               comment='[[Commons:Bots/Requests/Smallbot 9]]: Uploading Voice of America pronunciation files from http://names.voa.gov',
                               uploadifduplicate=False)
                #TODO-upload data.webm as file
            else:
                log('Empty at ' + str(counter))
        counter += 1

except:
    traceback.print_exc()
finally: #
    with open('last.txt','w') as lastfp:
        lastfp.write(str(lastsuccess))
    log('Done.')

Also need a 'last.txt' with the value of 6937

Smallman12q (talk) 20:45, 2 May 2013 (UTC)[reply]

Discussion

What should the file description be? Should I use {{Pronunciation}}? Smallman12q (talk) 20:45, 2 May 2013 (UTC)[reply]

Looks like this template is not popular, but it's good idea to standardize media files class descriptions. BTW is this source so unique and Commons doesn't have such pronunciations? :-) --EugeneZelenko (talk) 14:39, 3 May 2013 (UTC)[reply]
I don't believe Commons has these pronunciations. Is there some standard pronunciation template? I'll probably make one for the VOA files.Smallman12q (talk) 03:09, 4 May 2013 (UTC)[reply]
The template would read:

Voice of America pronunciation of <term> from the region of <region>. Transliteration: <transliteration>

Is that fine? It'll also auto-categorize by region and first letter of the first name so "AL-HALQI, WAEL" would be "WAEL AL-HALQI" and categorized by W. Is the letter/region categorization needed?Smallman12q (talk) 19:51, 4 May 2013 (UTC)[reply]

Well... this should clearly be marked as an american pronounciation recommendation. At least for the few german names I have checked this is certainly not the gold-standard for pronounciation (Erik Honnecker, Frantz Muntefering, and many more). --Dschwen (talk) 16:49, 3 May 2013 (UTC)[reply]

there is contact info, i'm sure they would be open to your feedback. [1] (or refer them to our local Goethe institute) - the value is that it is a currently maintained public domain source of pronunciations. Slowking4 †@1₭ 13:03, 4 May 2013 (UTC)[reply]

I've uploaded a few to Category:Terms from Voice of America pronunciation guide. Is it good to go?Smallman12q (talk) 14:03, 8 May 2013 (UTC)[reply]

It'll be good idea to include these files into some pronunciation categories. --EugeneZelenko (talk) 14:27, 8 May 2013 (UTC)[reply]
I could add them to Category:English pronunciation and also prepend the names with En-us so it'd be "File:En-us Abadilla from Philippines pronunciation (Voice of America).webm"? Would that be all?Smallman12q (talk) 23:17, 8 May 2013 (UTC)[reply]
Adding language code prefix is definitely good idea. BTW why not to upload in Ogg format? At least majority of pronunciations use this format. --EugeneZelenko (talk) 14:37, 9 May 2013 (UTC)[reply]
I've asked at w:Wikipedia:Village_pump_(technical)#Preferred_format_for_pronunciations whether it should be .webm or .ogg. Is there a reason you prefer one over the other? I can do either, it's only a one line change.Smallman12q (talk) 17:56, 9 May 2013 (UTC)[reply]
Bot is uploading as .ogg for all. Could you delete:
  • File:Egil Aarvik from Norway pronunciation (Voice of America).webm
  • File:Sani Abacha from Nigeria pronunciation (Voice of America).webm
  • File:Jorge Abadia from Panama pronunciation (Voice of America).webm
  • File:Abadilla from Philippines pronunciation (Voice of America).webm
  • File:Leonid Abalkin from Russia pronunciation (Voice of America).webm
  • File:Domingo Iturbe Abasolo from Spain pronunciation (Voice of America).webm

Smallman12q (talk) 00:36, 10 May 2013 (UTC)[reply]

You could just add {{Superseded}} or {{Delete}} on files.

If there is no other objections, I think task should be approved. --EugeneZelenko (talk) 14:31, 10 May 2013 (UTC)[reply]

Initial run is done. Will run monthly or so in the future.Smallman12q (talk) 23:20, 10 May 2013 (UTC)[reply]