Commons:Batch uploading/US National Archives

From Wikimedia Commons, the free media repository
Jump to: navigation, search

US National Archives[edit]

I plan to use a bot to uploads images from the US National Archives' digital files. I currently have access to a cache of over 120,000 TIFF master files which are ready for upload. The bot is a custom pywikipediabot script written by Multichill (code) and it relies on slakr's toolserver tool to translate NARA metadata into Commons upload code. It will upload images using the custom {{NARA-image-full}}. Each page will be uploaded with that template filled out with the imported NARA metadata, plus {{Uncategorized-NARA}} to facilitate the categorization of these files. Dominic (talk) 19:15, 20 July 2011 (UTC)


Moved form Commons:Bots/Requests/US National Archives bot I wrote a bot to do the uploads. I added the link to the source. Multichill (talk) 19:54, 19 July 2011 (UTC)

Pictogram voting comment.svg Comment For photographs, like 3 example uploads, I would suggest to look into a way to add more categories:
  • Author category
  • Date category
  • Subject category
  • Medium category (photographs, paintings, handwritten documents, etc.)
  • etc.
For other types of records other category types might be suitable. It is easier to add some of those categories before the upload. --Jarekt (talk) 01:43, 20 July 2011 (UTC)
I'm not sure how we could do any of these in an automated way. Not all documents have subjects, and the ones that do do not map onto Commons categories anyway. The same is true of the medium and author fields. The dates also seem difficult. Some of the dates are ranges, some are exact days, just months, or just years. Dates can represent dates of creation, copyright, publication, or broadcast. I am hoping we will be able to organize a major community effort for categorizing these, as it will take humans. The one thing that we can do is categorize them hierarchically according to the National Archives catalog structure. For example, each of the Ansel Adams items would go in the a category for the "Ansel Adams Photographs of National Parks and Monuments, compiled 1941 - 1942, documenting the period ca. 1933 - 1942" series. Of course, some of the series are less descriptive than others, but it's a start. Dominic (talk) 03:00, 20 July 2011 (UTC)
I think we should try 2 approaches. Add categories based on NARA catalog structure, We could make them hidden categories and encourage people to move images out of them, but this way we can group similar images together. I still think that we should try to match NARA authors with Commons creators and add appropriate categories. In my WGA upload all images have Creator template and matching author category. May be a way to accomplish that would be to create translation table there each NARA author is matched with a creator and category. Than your bot would read this table and use it to add proper templates and categories. Table can be easily added to the bot if it was implemented as external CSV file. We probably do not need to match every NARA author, since some might be quite obscure, but we should at least match all authors that already have creator template and authors with large number of records. Dominic, do you think it would be possible to put somewhere list of all authors of the files you are planning to upload and how many records are associated with them? I can try to see how many I can match. --Jarekt (talk) 16:01, 22 July 2011 (UTC)
This is what I did. I made {{NARA-Author}} for all of the authors. Every author (or person listed as a "contributor", whether it's a photographer, artist, director, etc.) has an ID and a page in the catalog that links to the records they are associated with. That template creates a URL to these author records in the catalog. I am not sure if that helps or hinders the attempt to make categories for them, but maybe we can use the template in some way to add categories based on those unique IDs? I will note, though, that it's actually uncommon for authors to be listed at all. Most documents are created by uncredited federal workers, and others are grouped into series based on the author, but the author field in the record isn't actually used (cf. this series). The full list of author records could actually be extracted from the dataset, if anyone is brave enough to try. Dominic (talk) 16:18, 22 July 2011 (UTC)
I did not noticed {{NARA-Author}} before. If it is added to all the images, that have author, than we can easily add creator templates and categories latter. BTW I did not see author records in NARA dataset or its description. --Jarekt (talk) 17:00, 22 July 2011 (UTC)
I do not know if there are separate XML files for the person authority records, like there are for items. However, if an item has a contributor mentioned in its record, the contributor's ID is also there in a field in the item's data file. This is how I am able to upload the files with that information. Dominic (talk) 19:04, 22 July 2011 (UTC)

As a test run, I have gone ahead and finished the Ansel Adams batch (220 files). [1] Dominic (talk) 04:45, 20 July 2011 (UTC)

End of move. Multichill (talk) 19:26, 20 July 2011 (UTC) I moved the discussion to here from Commons:Bots/Requests/US National Archives bot. We have two pages:

Why did I make this split? Because bot request take ages when we start discussing batch requests and a request gets closed when we actually want to provide more feedback. Can everyone please respect this? Multichill (talk) 19:26, 20 July 2011 (UTC)

ARC number[edit]

Another solution for ARC could be: store them all in a separate template page so that series ARC=408 would give "Record group 79: Records of the National Park Service, 1785 - 2006 (ARC identifier: 408)". This would make page description more concise and would also allow to add translations of record group and series names by editing only one page.--Zolo (talk) 01:51, 28 July 2011 (UTC)

We could even store more data than that in the template so that we would only need to provide the document ARC in the file description. This would not be as efficient but this would minimize duplicate info and would provide cleaner, potentially reusable data. Additionnally, this would make file description even easier by hiding away info that in most cases should not be changed by users. I have created a toy template in {{ARC/sandbox2}}. {{ARC/sandbox2|306514}} gives

NARA Logo created 2010.svg This media is available in the holdings of the National Archives and Records Administration, cataloged under the National Archives Identifier (NAID) 306514.

This tag does not indicate the copyright status of the attached work. A normal copyright tag is still required. See Commons:Licensing for more information.

English | Español | Français | Italiano | Македонски | മലയാളം | Nederlands | Polski | Português | Русский | Slovenščina | Türkçe | Українська | Tiếng Việt | 中文(简体) | 中文(繁體) | +/−

  • Record group: Committee Papers, compiled 1806 - 2000 (: 306513)
  • Series: 128: Records of Joint Committees of Congress, 1789 - 2004 (: 457)

This means that {{ARC/data}} will need to be quite large. To make it smaller, it could also be used for record groups and series only, and not for individual documents. But it would make it less useful.--Zolo (talk) 03:29, 29 July 2011 (UTC)

ParserFunctions are a bit beyond me, but could that possibly work with tens of thousands of records? Dominic (talk) 12:54, 29 July 2011 (UTC)


  • I think it would be useful for our users to have on each page a link to the relevant "Scope & Content" page of the photographic series the picture belongs to on the ARC website. These "Scope & Contents" pages contain valuable information on the origins of the pictures. As they are 2 clicks away from Wikimedia Commons, among a number of not-so-useful links, I think most users won't find them if we don't provide a direct link (We might also copy them to wikisource and link to the corresponding wikisource pages. We might copy them here on Commons if we get community approval for using gallery pages for that purpose). So for example, on this file it would be good to have the following : "Series: Signal Corps Photographs of American Military Activity, compiled 1754 - 1954 (Scope & Content)". I think "Scope & Content" is more important, for a first reading, than "Details". The "record ID" and "Source" fields should be merged and called "Source". Teofilo (talk) 23:21, 1 August 2011 (UTC).
    I changed my mind. I feel more like removing all the Record group, Series, NAIL Control Number information. The {{NARA-image}} template with its single arcweb link is enough. The users who want to know more can click on that single link which is an entrance to all the extra information. The "Record ID" field is not useful save the Nara-image template. Teofilo (talk) 08:44, 2 August 2011 (UTC)
    I doubt you'll find anyone agreeing with that point of view. And note that the ARC ID is more just the identifier that refers to the catalog record and allows us to make predictable URLs. The series and record group are actually descriptive metadata assigned by the archives that relate to the document creator and/or subject. Dominic (talk) 04:50, 5 August 2011 (UTC)
    I am afraid you are swapping the parts. Until now hardly any upload from the NARA was made by including those extravagant and noisy data which are not useful to a majority of users. You will find hardly anyone among those who uploaded contents from NARA in the past who agrees with you. For example File:USS Intrepid (CV-11) - Nov 44 a.jpg. That these extra data are not useful is common sense. For example, let's see how the Bundesarchiv pictures are documented. In the case of File:Bundesarchiv Bild 101I-731-0388-38, Frankreich, nach der Invasion, Infanteristen.jpg, all the extra information such as
  • Inventory: Bild 101 I - Propagandakompanien der Wehrmacht - Heer und Luftwaffe
  • Classification: Sachklassifikation/E {Zweiter Weltkrieg 1939-1945}/Ee {Kriegsschauplätze und Feldzüge}/Ee 300 {Westfeldzug}/Ee 350 / 360 / 370 / 380 {Frankreich*}/Ee 380 {Frankreich nach der Invasion (ab 6.6.1944)}/Ee 381 {Infanterie} Sachklassifikation/E {Zweiter Weltkrieg 1939-1945}/Ed {Truppen- und Formationsgeschichte*}/Ed 100 / 200 {Heer*}/Ed 110 {Infanterie}
was removed. Removing is the right thing to do. Please note also that the creator template was made collapsible because a lot of people found it too noisy. There is a wide support to the idea of keeping description pages streamlined and simple. Teofilo (talk) 09:24, 5 August 2011 (UTC)
  • Each page contains 2 links to en:U.S. National Archives and Records Administration. I think this is one too many (or two too many if you count commons:National Archives and Records Administration). Couldn't we just get rid of the "Current location" field altogether? Isn't the {{NARA-image}} template sufficient to mean that the pictures are located there ? Teofilo (talk) 23:02, 1 August 2011 (UTC)
    • NARA is a major US government agency with more than two dozen facilities. It's not a location. The location field is the record of where the physical document digitized on Commons is located. That the institution's name is linked more than once is because there are three separate templates used on the pages that are complete; it seems pretty trivial. Dominic (talk) 20:12, 2 August 2011 (UTC)
      Brainwashing the user by repeating three times the same message is an advertising technique amounting to using Wikimedia for a promotional campaign at the expense of usability. It overcrowds the template and makes the other information such as the author, date, or description fields proportionately less visible. The reason why the Artwork template contains both a "location" field and a "source" field is that we are dealing with photographs of paintings and photographs of sculptures. The "location" field is for the location of the painting/sculpture, while the source field is for the source of the photograph. For this reason, NARA uploads of paintings such as File:"Crocodile and Snake Fighting" - NARA - 558928.tif are wrong. The "location" field should be filled with "unknown", or with the name of the museum or of the private owner who owns the painting. Writing "National Archives and Records Administration, Still Picture Records Section, Special Media Archives Services Division (NWCS-S)" in the "current location" field of this painting is a mistake (for example compare with File:Serapis Louvre AO1027 profil.jpg, and count the number occurrences of the "Louvre" word there). For works that are just photographs, not photographs of paintings or photographs of sculptures, the "location" field should be removed. Teofilo (talk) 09:24, 5 August 2011 (UTC)
      These files are the records of a government agency, and the location field is the listing of the repository in which the records are held. That is not extraneous or unusual information. Your accusations of brainwashing and advertising are getting tiresome. The institution you are talking about is a public agency that holds public records; it is graciously making its high-res scans available to Commons with no strings attached. The "advertising" you are talking about is metadata added and maintained by Wikimedians because it is useful. Nothing of the sort has been demanded or even asked by the institution you are maligning. Dominic (talk) 16:13, 11 August 2011 (UTC)

File name maximum length and file name cutting format[edit]

The following is copied from Commons:Administrators' noticeboard/Blocks and protections#User:US National Archives bot

I think the bot should be blocked until the file-name issue is solved. See the "File:Combat memorable..." entry in Commons:National Archives and Records Administration/Error reporting or compare this NARA upload (name cut after "Gene") with previously uploaded picture with full name. Look at this list of 50 uploaded files where most of the file names are cut. It is not realistic to correct all these file name errors afterwards one by one, tagging each picture with {{Rename}}. The upload software bug must be solved so that the files are uploaded with the full name, without cut. Cut names not only produce an impression of bad quality upon users, it also creates a lot of potential wrong keyword searches in search engines. Someone looking for a "gene" (a biological system) should not find the "Alphonse Juin, Commanding Gene" picture in his search results. Teofilo (talk) 22:23, 30 July 2011 (UTC)

Er, you want it blocked? I can just turn it off, you know. I'm not exactly sure what the issue is, though. The titles get cut off when they reach the length limit. "The upload software bug must be solved so that the files are uploaded with the full name, without cut" is an impossible solution. This doesn't seem like a huge problem, certainly not one that's more important than getting the content on Commons. Most end users are going to be viewing the images on the projects, so the idea that these titles somehow negatively affect users because they are stylistically displeasing is a little baffling to me. Dominic (talk) 23:29, 30 July 2011 (UTC)
Oh? How come there no polite enquiry from Teofilo on either Commons:Batch uploading/US National Archives or User talk:Dominic? Oh wait... Jean-Fred (talk) 23:37, 30 July 2011 (UTC)For Jean-Frédéric, here is the Commons:National Archives and Records Administration/Error reporting link again, where the problem was debated between Dominic and me below the "File:Combat memorable..." entry. Teofilo (talk) 11:11, 31 July 2011 (UTC)
Actually, I posted a fix a couple of days ago for the problem Teofilo mentions. Oddly it hasn't been applied yet. --  Docu  at 05:36, 31 July 2011 (UTC)
Thank you for doing so. I was not aware that you had prepared a fix. Teofilo (talk) 11:11, 31 July 2011 (UTC)
I thought that that was about the dates appended to the end of titles. I don't see where you mentioned the issue Teofilo is concerned about anywhere on the page. Dominic (talk) 19:10, 31 July 2011 (UTC)
Who is(are) the person(s) in charge of the upload software ? According to en:Wikipedia:Naming_conventions_(technical_restrictions)#Title_length, "Titles must be less than 256 bytes long when encoded in UTF-8.". Measured with , File:US Navy 050419-N-5313A-049 A U.S. Marine Corps AV-8B Harrier launches from the flight deck of the amphibious assault ship USS Kearsarge (LHD 3) during flight operations in the Mediterranean Sea.jpg is 202 bytes long and File:Combat memorable donne le 22, 7re 1779, entre le Captaine Pearson commandant le Serapis et Paul Jones commandant le Bonh - NARA - 532895.tif is only 145 bytes long. So it looks possible to add 256-145=111 more characters into NARA uploads' file names. The full title "Combat memorable donne le 22, 7re 1779, entre le Captaine Pearson commandant le Serapis et Paul Jones commandant le Bonhomme Richard et son escadre, 07/22/1779" being 159 characters long, it should be OK. With 249 characters, "Pvt. Jonathan Hoag,...of a chemical battalion, is awarded the Croix de Guerre by General Alphonse Juin, Commanding General of the F.E.C., for courage shown in treatingwounded, even though he, himself, was wounded. Pozzuoli area, Italy.", 03/21/1944" is perhaps only one or two characters longer than the 256 limit after adding "File:" and ".tif". Also it could be decided to cut whole words instead of cutting in the middle of the words, and to use (…) at the location where the cut is performed, like I did for this upload of mine. Perhaps it would be best to always keep the date at the end of the title, and to cut the words located before the date. Teofilo (talk) 12:40, 1 August 2011 (UTC)
I am running a script that was written by Multichill; he's not in charge of the bot's actions, but I am not a programmer, so I can't easily make changes without him. I was not originally aware that the character limit was that high. I had thought that the limit was being imposed by the upload form, not by the bot's script, which is why I was saying it wasn't fixable. I see now that we can allow even longer titles, but I am not sure if we should. This should be discussed at Commons:Batch uploading/US National Archives, as the names already seem rather long and unwieldy to me. Your suggestion to not have it cut off titles mid-word, though, is a good one, I agree. In any case, I don't think this is a dealbreaker. The full titles are all contained in the template's "title" parameter, so we wouldn't have to go back and rename anything manually anyway, since a bot can extend the names using that data. I think it is more important to get the files actually uploaded at this point. Dominic (talk) 14:27, 1 August 2011 (UTC)

End of copy from Commons:Administrators' noticeboard/Blocks and protections#User:US National Archives bot

Do you have a deadline after which the files won't be available any longer ? File renaming is an activity which consumes a lot of resources and which is generally frown upon unless there is a good reason to do so. I am afraid the massive file renaming operation will be refused. When there is a problem in a car factory you stop the production line until the problem is solved. You don't sell the cars first and recall them a year later to change the defective part. The latter is more expensive. I think we need more opinions from people with bot software writing experience and help from people who would be willing to actually modify the script or write the file renaming bot's script. I am going to copy the present talk on Commons:Batch uploading/US National Archives. Teofilo (talk) 17:00, 1 August 2011 (UTC)
Well, I am only here for a couple more weeks. The files are not available on the Internet, but on hard drives here in the office. So it wouldn't be wrong to say there is a deadline of sorts. I am not sure the analogy to the factory is appropriate, as we're not recalling anything, just changing a name on a wiki. I'm not even sure if this is important enough that we would want to go back and change past uploads, even if we do change the convention going forward. They are not erroneous, just truncated. Dominic (talk) 17:21, 1 August 2011 (UTC)

For those who don't want to read all that text, the question is whether we want to make use of the full 250 characters we are allowed for the file names, which can be quite long, or whether we want to truncate it at a shorter length. The script is currently truncating at 120 characters, which isn't exactly short either, but does cause a lot of titles to get cut off. Dominic (talk) 17:21, 1 August 2011 (UTC)

I agree the file name issue should be fixed before next batch of uploads and I think we should keeping titles short. Lets concentrate on the issue of how to do it. Dominic, Is this still the code you are running? If so than I assume that the issue is with "if len(titleText)>120: titleText = titleText[0 : 120]" line. Docu, did you say you posted a fix somewhere? If so than where? I think we can solve this issue in the timely manner as not to slow down Dominic too much. --Jarekt (talk) 17:43, 1 August 2011 (UTC)
Yes, that is the code. It seems easy enough to change, except this is more a question of style than a bug in the code, so I'm not sure what chance, if any, to apply. (I think Docu is referring to the date issue, not this one, but I am not sure.) Dominic (talk) 17:59, 1 August 2011 (UTC)
The date issue appears on the NARA website too. It is not a simple upload bot script problem, although a script could help remove the extra date. I don't think there might be so many files with the date duplicate issue, so I guess it won't be so bad if we leave that issue unsolved. Teofilo (talk) 18:19, 1 August 2011 (UTC)
I have inquired, and these are actually not errors so much as limitations in the NARA catalog software. That "coverage dates" field, which is used to refer to the dates depicted in the document's subject rather than the document's creation, can only take ranges. When you put in a single day, it still makes it into a range. This isn't something they are going to fix. Dominic (talk) 18:35, 1 August 2011 (UTC)
A few more ideas:
1) Unwieldy ? Of course they are but we are in a situation where we must choose between the less unwieldy of two unwieldy possibilities. The possibility with extra-long names, and the possibility with names cut in an automatic fashion which creates wordings that are at times perfectly meaningless. It should not be forgotten that for a number of users English is a foreign language and it is less obvious when you don't master the language to understand that a sentence was cut and you should not even try to read a meaning. Also we should try as much as possible not to misrepresent the quality of the NARA's work. The NARA's work might have a number of shortcomings, but in any case the NARA does not produce botched file names.
2) While the files with a cut name are, in my opinion, a problem, there is no reason to prevent the bot from uploading all the other files with a short name. One possibility would be to quickly modify the bot script so that the files with long names are avoided for the time being, and to upload them later, after we have decided what to do with them.
3) One option would be to decide the new shorter names manually, on a case by case basis. We would have a bot write all the long file names in the left column of a table, and then we would request Wikimedians to write the shorter names with (…) in the right column. Then when all shorter names are available, the upload bot would be able to pick up the shortened files names from the table. Teofilo (talk) 17:48, 1 August 2011 (UTC)
Ensuring that we don't cut off names mid-word will help, as would adding "..." to the end when cut off will help. Note that even at 250 characters, some titles will be cut off. I am not sure (especially judging by Jarekt's reply) that there is agreement to do that, though. Dominic (talk) 17:59, 1 August 2011 (UTC)
I see 2 possible solutions:
  • Automatic: if filename is longer than 120 characters than look for periods, semicolons or commas and trim there. If string still longer than 120 than trim on the word end. Add ... in last case and may be in case of the trimming at a comma.
  • Manual: if filename is longer than 120 characters than (as Teofilo suggested) skip it for time being, while writing its ID and title to some log file. Than from time to time read the log file in Excel (or some other spreadsheet) and manually trim the title. Or post the file somewhere, so others can help (Teofilo?). Than alter your bot to allow upload of those specific files with provided filenames. I should be able to help with this part, if you need help.
The first solution is much less work. So that would be my preference. --Jarekt (talk) 18:49, 1 August 2011 (UTC)
1) If you are patient enough to read 120 characters, why aren't you patient enough to read 256 ? Both the NARA website designers and the Library of Congress website designers have felt normal to require from their users to read titles longer than that. For example the html < title > attribute of is 330 characters long. What is wrong with that ? If the Library of Congress asked you for advice, what advice would you give ? Also, the fact that a title is displayed on your browser page does not mean you have to read the whole of it. If you are tired with reading, you can stop reading and look at some other area of the page.
2) I tagged one the the NARA uploads with {{rename}} diff. The file was renamed today. Here is the result and I think it is much better (although I forgot to include the date). And I don't feel it is too long. If you remove the last part, the dramatic - tragedy - effect meant by the creator is lost. Sometimes titles are pieces of litterature, meant to create emotions. Many of these pictures were used for propaganda. The caption was perhaps as important as the scene represented. Teofilo (talk) 22:18, 1 August 2011 (UTC)
3) For people who are unhappy with file names longer than 140 characters (while being shorter than 256 characters) it may be possible to create a Javascript (or gadget, or fullfledged mediawiki extension) which automatically cuts the name that is displayed onscreen (with the possibility to read the longer version in a mouseover). Teofilo (talk) 23:41, 1 August 2011 (UTC)

I think you are looking at this entirely the wrong way. Relatively few people are looking at the images on Commons itself, and the ones that are are usually the editors that are maintaining them, not the people using the images. No one is really concerned about a long title looking a little unsightly at the top of a description page. We do, however, have to think about how this is going to be used on the projects, and huge file names make article text hard to read in the edit view and make Wikisource index pages incredibly odd-looking. And for what? You're writing as if the file name, which is clearly marked off with a "File:" and a ".tif" and has other data in it, is the title itself. It may be true that titles are pieces of literature and that they are important, but no one wants to remove the title. There is a title field in the metadata for that, quite apart from the file name. Dominic (talk) 00:15, 2 August 2011 (UTC)

The view that Commons is for Wikipedia is not very popular here. A lot of people insist that Commons should be viewed as a media repository independently of its value for Wikipedia. The file name is aslo important as being the caption you read when your mouse hovers on a file name below a thumbnail in a category page. Teofilo (talk) 00:39, 2 August 2011 (UTC)
4)I have found the following pictures from a batch upload a (171 B), b (170 B), c (176 B), d (177 B), e (174 B), f (176 B), g (180 B), which probably means the uploader did not found these lenghts annoying. Teofilo (talk) 00:39, 2 August 2011 (UTC)

For me filename needs to meet 2 requirements be meaningful and be unique. The second part (<20 characters) provides uniqueness, and the first part is trying to be be meaningful and I think 100 characters is plenty to accomplish that. I find long names to be distracting and award, and wikitext using them hard to read. However raising the maximum length of the filename would be by far the simplest way to "fix" the issue. --Jarekt (talk) 03:38, 2 August 2011 (UTC)

In my view, filenames needs to be authentic. If Shakespeare called his play "Romeo & Juliet" you can't rename it "Richard & Julia" because you have a personal liking for these names. If some obscure Office of War Information bureaucrat during World War II decided to call a picture "Members of the 6888th Central Postal Directory Battalion take part in a parade ceremony in honor of Joan d'Arc at the marketplace where she was burned at the stake" you cannot change it. The only alternative would be to use a totally cryptic name, like 43-0194a.gif. I don't think there is a middle unauthentic term between a totally cryptic name and the full authentic name. The argument that the full name is written in the "title" field of the template anyway, fails to convince me, because putting an unauthentic name in a more prominent place than the authentic name remains an aggression of authenticity. The choosing of a long caption or name in association with a picture by some administration during World War II is a historical fact. Even if you find that fact distracting or ugly, you can't change it. By the same token, some picture happen to be ugly. But for authenticity's sake one should not retouch an ugly historical picture to make it look nicer. If a picture has an ugly title, you can't change it either. You can't retouch "Romeo & Juliet". Teofilo (talk) 09:17, 2 August 2011 (UTC)
For this file, and this one key information, location and year, are cut. Teofilo (talk) 16:04, 2 August 2011 (UTC)
It is quite clear by now what your opinion is, Teofilo. What we are looking for is other opinions to see if anyone actually agrees with you. Dominic (talk) 20:17, 2 August 2011 (UTC)
The only absolute criteria for filenames are 1) uniqueness (easily done with the ARC) and 2) length is under the technical limit (easily done by truncation). All other considerations are cosmetic, as the full metadata is listed in the info template. The filename is just a key for the file database: it doesn't have to contain a perfect description of the image, most files at Commons don't. To be honest, we could call all images "NARA image - ARC 123456.tiff" and be done with it. So I don't think it matters where we chop the description. I'd lean towards shorter, as long filenames can be pain at Wikisource (we have the full name in the Page: namespace, for example), but that is a minor gripe. The metadata will always be in the info area, and only the ARC is required to uniquely identify the image. So, I'd say truncate at whatever is most convenient. Inductiveload (talk) 23:29, 2 August 2011 (UTC)

This file name cut removed the most important : Captain Harry Truman Teofilo (talk) 21:40, 3 August 2011 (UTC)

Teofilo, You provided dozen of examples of trimed filenames. However to me the only issues with those is that they are too long. I agree with Inductiveload that "filename is just a key for the file database" and that descriptions can be found inside file descriptions. --Jarekt (talk) 02:39, 4 August 2011 (UTC)
You wrote "I agree the file name issue should be fixed" above on this page on 1 August (diff). If you agree with Inductiveload that "filename is just a key for the file database", what is the issue which you want to fix ? Or have you changed your mind since 1 August ? Teofilo (talk) 12:58, 4 August 2011 (UTC)
Note that truncated names now only terminate at the end of complete words and include a "..." when there is any truncation. Dominic (talk) 04:31, 4 August 2011 (UTC)

File matching tool[edit]

I think we need a developer for the development of a file matching tool. That tool would use an interface similar to that of Cat-a-lot, with the possibility to select two files from a gallery page. Then the tool would

  • add the |Other version field in both files
  • pick up the categories from the older file and add them into the newer file (and vice-versa) Teofilo (talk) 12:08, 5 August 2011 (UTC)
This does not make sense to me. What gallery page? How will non-identical versions be detected by a bot? The eventual plan is to add JPG/DjVu versions of all these files by bot, so they will all have linked file in "Other versions" that will be usable on the projects at some point. Dominic (talk) 16:13, 11 August 2011 (UTC)

Author information retrieving bot[edit]

We need a bot to explore systematically all pages similar to in order to retrieve author information. At present such author information is not provided by the upload bot. Perhaps it is simpler to to this separately with another bot. I think I am personally getting tired to add this information manually (for example, see this diff). Teofilo (talk) 12:08, 5 August 2011 (UTC)

Those are not structured pages and I see no way for a bot to extract author information from them. There are some tasks that simply require a human. Dominic (talk) 15:43, 5 August 2011 (UTC)
All captions from (example : "Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Div. occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: `Officers keep out! Enlisted men's country.'" Pfc. H. J. Grimm, October 25, 1945. 127-N-138204) and similar pages should be extracted (by a bot or human) and put into the left column of a table. Then a bot should say if the file was uploaded on Commons or not, and if so, provide a link to the file uploaded on Commons, and say if the |author= is still void. Then humans could pickup the author name from the full caption. This would ensure that this is done in a systematic way, and that no chance was missed to find author names. Teofilo (talk) 15:30, 6 August 2011 (UTC)

Actually a bot could compare the string of characters in the full caption at and the string of character in the |title= field on Commons. For example, comparing ["Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Div. occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: `Officers keep out! Enlisted men's country.'" Pfc. H. J. Grimm, October 25, 1945. 127-N-138204] with [|Title=Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Division occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: "Officers keep out! Enlisted men's country."] would reveal that "Pfc. H. J. Grimm, October 25, 1945. 127-N-138204" was left out. After all left out parts are neatly listed in a table by a bot, humans could try to figure out what they can do with them. Teofilo (talk) 15:44, 6 August 2011 (UTC)
I think you missed the point. How do you know what to compare? You have a Commons image file, and then you have a string of characters on a random webpage. If a human has to find and point the script to the line on the page that has the information, it kind of defeats the purpose. Dominic (talk) 17:10, 8 August 2011 (UTC)

Categorizing progress statistics software[edit]

[concerning Commons:National Archives and Records Administration/Categorize/Progress ]


Would it be possible for BernsteinBot to compile more data ? At present the "categorized" column on, for example, this page only provides a boolean "categorized" YES/NO parameter. Would it be possible to retrieve the number of added categories and to calculate the percentage of files with 2 or more categories, with 3 or more categories, etc... ? Especially if the number of categories is only one, I consider that the job is not finished. Files should have at least 2 or 3 categories, in most cases. It would be good to have a way to find the files with only one category, so that people can quickly go to those files to finish the job. Teofilo (talk) 12:58, 1 August 2011 (UTC)

The above is a copy of a message I left on Bernsteinbot's owner talk page Teofilo (talk) 12:12, 5 August 2011 (UTC)

I think we need also statistics to control whether the |author field has been completed or is still left blank. Teofilo (talk) 12:19, 5 August 2011 (UTC)

It operates based on normal Commons procedure. Files are either uncategorized or they're not. I don't see much evidence for your opinion that files with only one category are "unfinished". It would be nice to collect some of these statistics for measuring outcomes, but I'm not convinced it would be very useful (or very much used) by people categorizing. Its certainly not a pressing need. Dominic (talk) 17:05, 8 August 2011 (UTC)

Using en language templates[edit]

Dunno if there'll be any further bots edits to the already uploaded images, but I guess there will. So if there is a chance, could someone please add {{en|…}} around the descriptions (title and general notes)? I'm a bit surprised that this (apparently) didn't happen already on upload. Using the template would make future translations a bit easier, and is generally recommended here on Commons for internationalization issues (even if it's only regarded as helpful for users who don't speak English, to allow quick and easy identification of the language used). Many thanks in advance --:bdk: 14:32, 20 August 2011 (UTC)


This page is getting very unwieldy. I am going to be marking and collapsing threads that seem to be resolved so that it is easier to navigate the page and see what needs to be addressed. If anyone feels that I have erroneously marked something as resolved, please feel free to uncollapse it and say so. Dominic (talk) 17:45, 11 August 2011 (UTC)

I marked general questions of categorization as resolved, as we have developed a process for assisting editors in categorizing. Every image uploaded is given {{Uncategorized-NARA}}, which places it in Category:Media contributed by the National Archives and Records Administration. Each file is also automatically placed in a category for its NARA series. We have an automatically updated project page at Commons:National Archives and Records Administration/Categorize/Progress where Commons editors can see the progress of per-series categorizing and navigate down to to a list of individual images that need categorizing. In tis way, hopefully adding topical categories for all files will be manageable. Dominic (talk) 18:43, 11 August 2011 (UTC)
Open issues

I am trying to summarize the issues that are in any way open, so we can bring some closure to this and the uploading can be completely above board.

  1. Can we automatically match NARA author data with Creator: templates and categories on Commons? — I'd like to work on this, but it can be done within the template, so it doesn't need to block uploads.
  2. Do we want to move the "NARA - <ID> - " part of file names to the front? — It will stay as is unless we hear from more people that they want this.
  3. Storing metadata on a separate template. — I wasn't entirely sure how useful or even possible this is, so I have left it alone in case others have thoughts.
  4. Teofilo's requests:

It seems to me that all of these fall into the category of things that can be worked on during/after the actual upload of files, with the possible exception of the file name lengths. However, that and several others either do not seem very well supported or thought out. New comments, even if it's just simple agreement or disagreement, would help clarify the level of support. Dominic (talk) 19:09, 11 August 2011 (UTC)

Uploaded Progress Recent uploads Category
199,995 81 % Gallery Category:Media contributed by the National Archives and Records Administration

81% completed (estimate)


Assigned to Progress Bot name Category
Dominic 81 % US National Archives bot Category:Media contributed by the National Archives and Records Administration