Commons:IA audio

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Shortcut: Commons:IA audio

Introduction[edit]

This is a spin off batch upload project from the code of Commons:IA books, with the aim of uploading freely licensed podcasts from the Internet Archive.

Uploads can be listed with intitle:"(IA " incategory:"Sound files uploaded by Fæ" insource:Podcast.

Implementation[edit]

The batch upload runs Pywikibot in Python and uses the internetarchive package to run an IA search for "subject:Podcast <topic>" and uses item.metadata to return whatever metadata fields are available. If an mp3 in VBR MP3 format is not found, the item is skipped.

Where multiple MP3 files are available under the same item, and these match the 'VBR MP3' format, then the presumption is that these are sequential editions of the podcast. The first file will be uploaded with the default name based on IA title, with later files having sequential numbers. For example WandaWisdomLuckyBitchRadioPioneerDays has 2 episodes, identified on Commons as Lucky Bitch Radio- Pioneer Days! (IA WandaWisdomLuckyBitchRadioPioneerDays).mp3 and Lucky Bitch Radio- Pioneer Days! (IA WandaWisdomLuckyBitchRadioPioneerDays-2).mp3.

Copyright[edit]

A large variety of licenses are used on IA. Where no licenseurl is found in the IA metadata, the items are skipped as unverifiable. The following mapping of licenses to Commons templates has been created based on examples as they arise:

lics = {
	'by/3.0':		"Cc-by-3.0",
	'by/3.0/us':	"Cc-by-3.0/us",
	'by/2.5':		"Cc-by-2.5",
	'by/4.0':		"Cc-by-4.0",
	'by-sa/2.5':	"Cc-by-sa-2.5",
	'by-sa/2.5/es':	"Cc-by-sa-2.5-es",
	'by-sa/3.0':	"CC-BY-SA-3.0",
	'by-sa/3.0/us':	"CC-BY-SA-3.0-US",
	'by-sa/4.0':	"CC-BY-SA-4.0",
	'mark/1.0':		"PDMark-owner",
	'zero/1.0':		"Cc-zero",
	}

The use of {{PDMark-owner}} is a new template at the time of the start of this upload, however based on proposal and discussions the principal of accepting PDM releases by the creator/artist is not thought controversial.

Mixed copyright[edit]

Though there is a single copyright url in the IA metadata, this relies on volunteers working out how to write up the release. Many of the podcasts rely on background music or music samples. This may include brief segments of soundtracks from films. These are legitimate usage for the original podcasts as samples will come under fair use provisions, however Commons' interpretation of Licensing policy does not give room for fair use. Where the podcast is talking over the sampled music or soundtracks, there may be a possible argument that the sample is sufficiently de minimis to be compliant. In cases where this fails, it would be a reasonable solution to clip out, or blank out, the problematic segments in the recording and leave comments on the image page to alert re-users.

A useful free tool for cropping out copyrighted segments from a podcast is to locally download the file and edit it in the visual editor Audacity. Problematic sections can be blanked to keep the timings, or interview sections extracted and uploaded separately, or the music can be cropped out completely, leaving only the sections with known suitable copyright. In the example of Substral PolyWohnzimmer there were multiple music segments with no derivatives and non-commercial restrictions, even though the spoken word sections were released as CC-BY, this has now been overwritten with an audio file limited to the spoken content which is about half the length.

Edited and overwritten audio files can be added to Category:Candidates for revision deletion to ensure the version with copyright issues against Licensing policy is removed.

Layout[edit]

Commons has no good or obvious template to use for audio podcast files, so the standard {{Information}} box is used with added other fields for useful extra metadata. Example:

|other fields = 
 {{information field|name=Title|value=A History Of: Hannibal. Episode 26 - In the Bleak Midwinter}}
 {{information field|name=Publicdate|value=2012-12-16 18:33:22}}
 {{information field|name=Subject|value=A; History; Of; Hannibal; Episode; 26; Twentysix; In the Bleak Midwinter; Jamie Redfern; Hannie Kirkham; Rome; Carthage; Punic Wars; TheHistoryOf Podcast; Sempronius; Scipio; Victumulae; Po; Placentia; Apennines}}
 {{information field|name=Collection|value=podcasts}}

Programming daemonics[edit]

As some larger mp3 files appear to suffer the same issues with the WMF API never returning a successful upload message, the same technique of time-boxing has been applied as for the PDF batch uploads. A single upload thread is spawned and file existence checked every 15 seconds, terminated if already uploaded, or after 5 minutes the thread is terminated as unsuccessful. Refer to Phab:T254459 Large PDF upload issue.

Known issues[edit]

Unexpected mimetypes
Some files have detected mimetypes which do not match the expected mp3 audio format. For example 12ZombiesAteMySonicExeSenseiPongcast (with mp3 size 120 MB) is detected by the Commons API as an executable file. This may either be an error at the IA end with wrong file formats stored, or an issue with the WMF server tool that checks mimetypes.
Non-unique
IA identities may be allocated to the same object by accident on IA. During upload duplicates will be detected by the API and rejected. For example ALittleDeadComicBookUpdatefor23_27 = ALDPEP19-23OCT2009 = ALittleDeadComicBookUpdatefor23_958.

Exemplar deletion requests[edit]