User:Faebot/guide

From Wikimedia Commons, the free media repository
Jump to: navigation, search

I am cobbling together some example items here from Faebot's use that may help other Python based bot designers. This is to support a presentation for GLAMwiki in 2013, but may turn into a more general guide. -- (talk) 10:09, 4 February 2013 (UTC)

Key points to plan for when designing a bot[edit]

Before writing any code:

  • Check it is in scope and does not duplicate something else.
  • Check there is not an easier way of achieving the outcome.
  • Establish a consensus for it.
  • Start with one example.
  • Confirm copyright status of content.
  • Try to avoid starting a war.

Features of any code:

  • If an uploading bot, always check for duplicate files and make good use of templates for preserving context.
  • Take an Agile approach and test as you go along.
  • Ready? Test using a dry run and first batch run.
  • Made a mess? You fix it because nobody else wants to.

Tricks and tips[edit]

Command line parameters[edit]

It is a basic technique, but terribly useful if you want to use the same script to import different sub-pages on a complex website, or rerun a batch task from a certain position. For Python, command line parameters are passed as an array to sys.argv. The first array item is reserved, so the first user passed parameter to read is in "argv[1]". Treat as any other array.

import sys
...
skip=0
if len(sys.argv)>1:
  skip=int(float(sys.argv[1])) # If there is a first parameter, use it as a number

How to spoof your user agent[edit]

You may want to spoof your user agent when reading URLs, particularly for sites that are likely to track such things, rather than using Python's default ("Python-urllib/2.6"). In the standard manual this is covered by urllib2.request[1].

import urllib2
...
headers = { 'User-Agent' : 'Mozilla/5.0' } # Spoof header
webpage = "http://example.com" # Web page to read
req = urllib2.Request(webpage, None, headers)
html = urllib2.urlopen(req).read()

Colour hack for terminal window on a mac[edit]

How Faebot looks in a mac terminal.
#  Colours only on mac
Red="\033[0;31m"     #Red
Green="\033[0;32m"   #Green
GreenB="\033[1;32m"   #Green bold
GreenU="\033[4;32m"   #Green underlined
Yellow="\033[0;33m"  #Yellow
Blue="\033[0;34m"    #Blue
Purple="\033[0;35m"  #Purpley
Cyan="\033[0;36m"    #Cyan
White="\033[0;37m"   #White

As I sometimes run my scripts from Windoze on a laptop, I often watch for the last parameter in argv to see if "w" has been added to the command line; I then set these to blank strings.

Update I have moved over to using the Python colorama module, a much smarter solution, so long as you can install a module. See https://pypi.python.org/pypi/colorama

How to check Tineye for the usage of an image elsewhere on the internet[edit]

Tineye has an API, but it is not a free service. These functions scrape the number of matches from the Tineye website for any given image title, it would be easy to extend this to returning a list of top matches (where there are any). Based on the Tineye website terms, this may be limited to 50 images per day and 150 per week. Google image searches have no such limit I am aware of, and so may be a better alternative.

  1. These functions rely on importing BeautifulSoup.py[2] and urllib2.py
  2. xmltry() and xmlreadtry() are functions which defensively open the URI and defensively read from it, they could be replaced with standard calls to urllib2 open and read functions. Defensively means that they have a go, then sleep and try again at increasingly longer wait intervals, if my internet connection is dropping out or the source website is not cooperating.
  3. Red and Blue are global variables as per the last tip.
def getThumb(image):
    api="http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=url&iiurlwidth=300&format=xml&titles="
    # 300 px is arbitrary but used in the Tineye Commons gadget
    uri=api+urllib2.quote(image)
    u=xmltry(uri)
    x=xmlreadtry(u,uri)
    try:
        return BeautifulSoup(x).find('ii')['thumburl']
    except:
        return ''
def getTineye(i):       #       return the number of matches as an integer
    image=getThumb(i)
    if image=='': return 0, ''
    tin="http://tineye.com/search?&sort=size&order=desc&url="+urllib2.quote(image)
    url=xmltry(tin)
    html=xmlreadtry(url,tin)
    try:
        return int(float(BeautifulSoup(html).find('div',{'class':'search-content-results-header-details'}).find('h2').find('span').contents[0])), tin
    except:
        return 0, tin
def pTineye(image,txt):
    n,u=getTineye(image)
    if n==0:
        return '',txt
    print Red+"Tineye has",n,"matches",Blue+u,Yellow
    if n>1:
        txt+='\n[[Category:Mobile uploads lacking EXIF data and with multiple Tineye matches]]'
    return 'WARNING: Tineye found '+str(n)+' matching images. ',txt