Commons talk:Batch uploading/Geograph

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Contents

database of locations[edit]

I have a database of locations (including lat/lons) and was wondering if the methodology you apply here with these photos is somewhat akin to what I need to do to batch create geotagged wiki pages in my wiki which has the google map extension enabled?

Kim --IP 10 August 2010

Really long filenames[edit]

Are you going to continue using the extended filenames with geograph.co.uk in them? I never did get why this was necessary the first time around. Jarry1250 (talk) 12:39, 17 August 2010 (UTC)

Talk page is the place for comments[edit]

Things seem to have got a little confused here, everyone is using the project page for comments- I have a go at straightening it out over the week end if no-one objects --ClemRutter (talk) 21:53, 17 March 2011 (UTC)

As there have been no objections, I have copied across the discussions, and will now rebuild the project page. ( I haven't copied over the history as i am in a quandary as to whether it would be the best thing to do. --ClemRutter (talk) 21:54, 3 April 2011 (UTC)

Discussions[edit]

Geograph[edit]

In the Village pump Perry Rimmer brought up the suggestion of copying all these files to Commons. Geograph is a site containing about 1.5 million {{cc-by-sa-2.0}} images of the British Isles. The Isles are divided in 1 km by 1 km squares and the goal of the project is to get at least one photo of every square. 250,397 grid squares, or 75.5% of all squares currently have an image. Most of the images we use at the English Wikipedia to illustrate villages in the United Kingdom come from this site. The quality of the images is not that high, but nevertheless this is a very rich resource. Dumps of the databases are available and also torrents containing the files. I will contact the people behind this project if we can make some sort of cooperation project of it. Before I start actually uploading images I want to do several things:

  • Build category trees like Category:Towns and villages in England based on enwp and the list at Geograph
    I build the village/town tree for the UK and Ireland.
    Category trees for subjects (like "bridges") still has to be build
  • Populate these trees with the current images
  • Clean up the current uploads
  • Wait for more disk space to arrive

Imported the database dump at the toolserver. It should be straightforward to extract all information from the database. Categories on the other hand is probably going to be a nice challenge. Found several possible tools

- Multichill (talk) 16:44, 19 November 2009 (UTC) (more to come)

Update December 2009[edit]

I downloaded the first 250.000 images (about 25GB) and the database dump. With these two combined it's quite easy to generate descriptions and filenames. I modeled this after geograph_org2commons as everybody seems to be happy with that. Categorization on the other hand is hard. I take the following approach:

  • Get locations from http://ws.geonames.org/extendedFindNearby
    • I mapped some id's (earth/Europe/countries/counties/etc) to a Commons category in a database. I first take a look in this database
    • If the id is not in the database I'll take a look if I can find a category at Commons with a similar name
    • I know have a bunch of location categories at Commons
  • Get the topic from Geograph
    • I mapped Geograph categories (=imageclasses) to Commons categories. See if the Geograph category is in the database
    • If the Geograph category is not in the database look if you can find a category at Commons with a similar name
  • Combine location and topic: Try to find categories deeper in the tree. So for example not Category:Churches, but Category:Churches in England
  • Filter the categories
    • Follow redirects
    • Remove disambiguation categories
    • Filter out overcategorization

This seems to work alright, but for now I'll add {{Check categories-Geograph}} to the images to be sure. Some issues I expect to encounter:

  • Disambiguation problems. Some images will end up in strange categories because a lot of names aren't properly disambiguated
  • Not properly filtering out overcategorization because the tree is broken. For example we have a lot of Topic in Europe categories, but Topic in the United Kingdom is not a subcategory
  • Some categories will get crowded because the tree for the United Kingdom and Ireland hasn't been build yet

The source is available (work in progress!). Oh, and btw, the usual tricks apply so filenames get cleaned up and no duplicates are uploaded. Multichill (talk) 18:06, 3 December 2009 (UTC)

Did a test upload of 365 files. Feedback is appreciated. Multichill (talk) 23:24, 3 December 2009 (UTC)

This is good news, I think it will bring many interesting images. I looked at the ones here. Just a few points for now:
  1. "in Europe": #398 is in both "Buildings in Europe" and "Buildings in Dorset" (and also "Buildings in England"), possibly because Category:Buildings in Dorset is not in "Buildings in England. #256 is in both "Churches in Europe" and "Churches in Hampshire", despite the later being a subcategory of the former.
    Yes, a lot of trees are incomplete. These will show up. Already fixed a couple. Please fix the tree if you spot problems like this.
    Will do.
  2. Dot: When the image title doesn't include a "." at the end, one needs to be added when combined for the description (samples: #256, #106). If there is one in the title, it doesn't necessarily need to appear in the file name (#38).
    This is something so minor, i'll just keep it this way
    A minor fix, but could look at #106? "A view from East Cliff Beach across to Charmouth and Stonebarrow Hill This image is taken from the last concrete groyne" just just looks odd.
    Oh right, now I understand you. Thought you were talking about the title, but you're talking about the description. Will if. (note: strip(), if not last char . -> add char). Multichill (talk) 17:50, 4 December 2009 (UTC)
  3. Filename: I'd limit the "geograph.org.uk - 38" to something like "GG0000038". I don't see much benefit to use the domain name there. I was going to suggest to add the date to the file name, but for some of the files, this seems to be "unknown". 00000000 could be an option for these. Combined with the number this could be "GG20091204-2113138" for a file of today.
    I modeled the filename after Magnus' tool. I like it because it prevents name collisions and is easy to understand
    The full domain name seems excessive and I think the date should be in there. For the Navy pictures, this allows easy sorting.
  4. #256 is in category:Ibsley instead of Harbridge, but I figured out why.
    Yup, the location tool is not right all the time.
  5. #106 is in Category:Coasts and #36 in Category:Hills. Such general categories might fill up quite quickly.
    For 106 this happend because the location tool didn't return a suitable location (Coasts in the English Channel doesn't sound very suitable to me)
    For 36 an intersection between Category:Hills and Category:Isle of Man should be made. Creating these kind of categories will prevent them main categories from filling up
  6. Template: The layout of {{geograph}} could need some work, but this isn't really related to your upload. I already made a request to remove the interwiki from the template (#iw)
    I would like to have a similar layout as {{Fotothek-License}} or {{KIT-license}}, but less discus that at Template_talk:Geograph
  7. Stray text: Some images still have the "Importing image file" text (e.g. #278)
    Yeah, noticed that too. Something went wrong with the import. Removed it from most of the files.
  8. Headers: As the headers are optional, I think we should drop at least {{int:filedesc}} .
    I like to add them for the non-English speakers
    "Description" is translated too, so "Summary" isn't really needed.
  9. The images seem to be fairly old, maybe it's worth doing a test with more recent ones.
    Despite this list, I think the overall quality of the import is good. It is likely to give quite a lot of categorization to do. -- User:Docu at 06:16, 4 December 2009 (UTC)
    That's right. I started with the oldest images and work my way to the newer images. I'll add some more manual categorization mappings. If trees are build and corrected for the most used categories (like Churches) before I compile my batches it will save a lot of time. I'll make a list of important categories to work on. Multichill (talk) 10:45, 4 December 2009 (UTC)
    I replied above. -- User:Docu at 17:00, 4 December 2009 (UTC)
    At User:Multichill/Geograph/categories I put a list of categories. This is based on the 1.5M files in the database. This covers about 60% of the files. I'm working on raising this to at least 80% (will update the list accordingly). For all these categories the trees should be checked and build. The layers to check:
    Sometimes a topic in Europe category exists (for example Category:Churches in Europe). The country categories should be made a subcategory of this. Generally all the topic by location categories should have two or more parent categories. If not, there's probably something missing. Multichill (talk) 13:27, 4 December 2009 (UTC)
    Given the mere amount of images, it might be worth making county categories for topics that otherwise might not be categorized that way. This until we have subcategories for specific features or structures. Your bot, is it already set to make them? BTW could you add "heading:?" to the coordinates? -- User:Docu at 17:00, 4 December 2009 (UTC)
    Most of the categories in my list are already divided by country. It's more about the lower layers. I don't have a bot to create these categories automagicly.
    Heading is added when it's known, see for example File:Aldershot - Home of the British Army - geograph.org.uk - 177.jpg. Multichill (talk) 17:50, 4 December 2009 (UTC)
    I made a matrix of categories here Commons:Batch uploading/Geograph/cat-matrix. So all red categories should be created? If they have the right in/of :-) --MGA73 (talk) 19:07, 4 December 2009 (UTC)
    Yes, for many probably also the subcategories for counties. Keep in mind that the final count could be easily be 2-4 times the quantity listed. -- User:Docu at 05:46, 5 December 2009 (UTC)
    All done now. Maybe Multichill can get a bot to push excisting images down in the new tree? --MGA73 (talk) 21:56, 7 December 2009 (UTC)
    Or maybe someone else can write it ;-) Maybe something in combination with {{Populate category}}. If the image is in both parent category, move it to the underlying category. Have to think about that. What to do if it has more than 2 parent categories? Etc. Multichill (talk) 17:16, 8 December 2009 (UTC)

<unindent>To make sure they get categorized, I made five empty categories for Llyns (Special:PrefixIndex/Category:Llyns). Would these work for your bot that way? If I redirect them to corresponding lake categories, would that work too? -- User:Docu at 10:55, 6 December 2009 (UTC)

If a en:Llyn is just an other word for Lake shold we not just have Multichill tell the bot that Llyns = Lakes? --MGA73 (talk) 18:09, 7 December 2009 (UTC)
My second question aims at that. Otherwise, I can merge them later. Obviously, I prefer to see them categorized as Llyns rather than not at all. -- User:Docu at 19:37, 7 December 2009 (UTC)
This will work, but it's probably easier to add a database entry so that my bot knows that Llyn means category:Lakes. I already did this for the top categories. Should cover about 80% of the images. Feel like helping to increase this hitrate? Multichill (talk) 17:16, 8 December 2009 (UTC)
If someone wants to help they can work on User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox. --MGA73 (talk) 17:33, 9 December 2009 (UTC)
  • It looks like people start requesting renames for files named similar to the ones used by the bot: #1040807. -- User:Docu at 11:55, 12 December 2009 (UTC)
Yes but look at the reason. If filename is wrong then we can rename. --MGA73 (talk) 14:39, 12 December 2009 (UTC)
I don't disagree on part of the request, but the requestor also added "it contains inappropriate information about the source" and removed " - geograph.org.uk - 1040807". -- User:Docu at 14:53, 12 December 2009 (UTC)

Categorization[edit]

  1. Match location id's to Commons categories. Almost done.
  2. Match geograph topic categories with Commons categories. Working on it, see User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox.
  3. Create topic by location categories. Working on it, see here for a list and here for a matrix
  4. Some geograph to Commons category matches turned out to be somewhat strange. Check and correct this list. List here

Multichill (talk) 23:52, 12 December 2009 (UTC)

For (4): in the list, there are a few matches I don't understand: why does "sea loch" match "lake" rather than "sea lochs"? "Loch" should match "lochs", not "Bodies of water".
Currently there is "Village_sign -> Category:Signs": Is there a way to create just "Category:Village signs in the United Kingdom", etc. to avoid that they go into too general categories?
To avoid problems of the "Churches in Europe" type above, maybe the matching should either work around missing continent links or we should try to run a bot to fix the categories before. I tried using CatScan2 to find such categories, but it seems to time out.
BTW I added a redirect for Bogs. It was missing despite there being "Category:Bogs by country (we would still need to make UK/IE specific categories. Please check if this works. -- User:Docu at 03:55, 13 December 2009 (UTC)
Category:Sea lochs was missing: fixed that. -- User:Docu at 15:09, 13 December 2009 (UTC)
Matches are work in progress. A lot of them have been changed.
I don't want the categories to be too specific either so we have to find a balance.
Looks like I tackled all the Europe categories. If I missed some it's easy to fix (create link, use bot to filter the category). Multichill (talk) 16:32, 13 December 2009 (UTC)
I moved my sandbox to Commons:Batch uploading/Geograph/Sandbox. Better we work there so my own "testing" does not ruin something. --MGA73 (talk) 08:12, 14 December 2009 (UTC)
For Europe/UK, you got almost all of them: I fixed 11 missing ones: list. -- User:Docu at 15:23, 16 December 2009 (UTC)

If someone thinks we should have more categories please leave a note Commons:Batch_uploading/Geograph/cat-matrix#Sub-matrix_for_counties. --MGA73 (talk) 09:48, 17 December 2009 (UTC)

Please note that Category:Trees is a Main category and should not have files added directly to it. Please only use subcategories of it! As it is, I've just been saddled with 95 files which I'll have to recategorise now :-(( Thanks - MPF (talk) 01:59, 31 January 2010 (UTC)
See #Comments on ongoing upload. Multichill (talk) 09:23, 31 January 2010 (UTC)

Progress December 2009[edit]

This table is to keep track of the progress of the upload. All directories are located in /mnt/user-store/geograph/torrents at the toolserver.

Source dir Destination dir Prepared Imported
geograph_vol001_image_0_to_49999/00 geograph_vol001_image_0_to_49999_prepared/00 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol001_image_0_to_49999/01 geograph_vol001_image_0_to_49999_prepared/01 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol001_image_0_to_49999/02 geograph_vol001_image_0_to_49999_prepared/02 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol001_image_0_to_49999/03 geograph_vol001_image_0_to_49999_prepared/03 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol001_image_0_to_49999/04 geograph_vol001_image_0_to_49999_prepared/04 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol002_image_50000_to_99999/05 geograph_vol002_image_50000_to_99999_prepared/05 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol002_image_50000_to_99999/06 geograph_vol002_image_50000_to_99999_prepared/06 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol002_image_50000_to_99999/07 geograph_vol002_image_50000_to_99999_prepared/07 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol002_image_50000_to_99999/08 geograph_vol002_image_50000_to_99999_prepared/08 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol002_image_50000_to_99999/09 geograph_vol002_image_50000_to_99999_prepared/09 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol003_image_100000_to_149999/10 geograph_vol003_image_100000_to_149999_prepared/10 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol003_image_100000_to_149999/11 geograph_vol003_image_100000_to_149999_prepared/11 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol003_image_100000_to_149999/12 geograph_vol003_image_100000_to_149999_prepared/12 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol003_image_100000_to_149999/13 geograph_vol003_image_100000_to_149999_prepared/13 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC) ✓ Done
geograph_vol003_image_100000_to_149999/14 geograph_vol003_image_100000_to_149999_prepared/14 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol004_image_150000_to_199999/15 geograph_vol004_image_150000_to_199999_prepared/15 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol004_image_150000_to_199999/16 geograph_vol004_image_150000_to_199999_prepared/16 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol004_image_150000_to_199999/17 geograph_vol004_image_150000_to_199999_prepared/17 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol004_image_150000_to_199999/18 geograph_vol004_image_150000_to_199999_prepared/18 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol004_image_150000_to_199999/19 geograph_vol004_image_150000_to_199999_prepared/19 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/20 geograph_vol005_image_200000_to_249999_prepared/20 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/21 geograph_vol005_image_200000_to_249999_prepared/21 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/22 geograph_vol005_image_200000_to_249999_prepared/22 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/23 geograph_vol005_image_200000_to_249999_prepared/23 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/24 geograph_vol005_image_200000_to_249999_prepared/24 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done

Uploading[edit]

I've been told, that new space should be ready this week :-) --MGA73 (talk) 13:00, 25 January 2010 (UTC)

Yeah, new disk space is there. Batches are ready to be imported. Multichill (talk) 19:29, 27 January 2010 (UTC)
Assigned to Progress Bot name Category
Multichill Batches prepared ready to import GeographBot Category:Images from the Geograph British Isles project

Do you have an update on which new batches have been uploaded? This is more exciting than Christmas. --ClemRutter (talk) 18:52, 22 February 2010 (UTC)b

No, not really, image 1 to 250.000 have been uploaded and now we're in the process of getting these images properly categorized. Multichill (talk) 18:44, 24 February 2010 (UTC)

Comments on ongoing upload[edit]

were added to Category:Lakes. Shouldn't they appear in some subcategory? -- User:Docu at 10:55, 30 January 2010 (UTC)

Yep, they should. Looks like some images didn't get a location based category (probably because the location tool wasn't working at the time of compiling). Will have my bot go over them to find better categories. Multichill (talk) 13:31, 30 January 2010 (UTC)
BotMultichillT is now busy finding better categories for all images which ended up in a topic category (like Category:Lakes) and not in a topic * location category (like Category:Lakes of England). Multichill (talk) 17:26, 31 January 2010 (UTC)
Recently uploaded File:Mitcham - geograph.org.uk - 107685.jpg is the same as File:Mitcham Railway Station.jpg which I uploaded over two years ago. Also - the remark has been made above - but the way the bot names and categorises images creates a lot of extra work. Ravenseft (talk) 21:00, 31 January 2010 (UTC)
Images are similar but they are not the same. Ohe has the number 107687 and the other one 107685. --MGA73 (talk) 21:19, 31 January 2010 (UTC)
107685
107687
The images are not the same, see here on the right.
How does the naming of images cause extra work? Multichill (talk) 21:22, 31 January 2010 (UTC)
Aside from recategorising, the bot has the unfortunate habit of transforming a photographer's blurb into an image title. So one ends up with, for example, File:Old station building - geograph.org.uk - 14935.jpg which provides no clue as to what station we're talking about. Such images will all have to be renamed, as will those where the photographer has wrongly identified or spelt the station name. All extra work in an area where precision is key. Would it not be possible to prevent the bot from uploading images into the Disused Stations categories? Ravenseft (talk) 21:33, 31 January 2010 (UTC)
Another comment which could be made is that some geograph images do not merit to be uploaded into Commons - the 2nd Mitcham image above being a perfect example. Bulk uploading is simply going to make it harder to spot the wood from the trees. Ravenseft (talk) 21:36, 31 January 2010 (UTC)
Since the two images are not identical I fail to see why the 2. one should not be uploaded to Commons. As for the comment on not putting images in "disused stations" I wonder where the images should be placed? If the images shows a disused station it looks like a good place to put it. An finally yes some images might be named wrong if they are mis labled on geograph. But it is hard for us to know. If mistakes are found please just add {{rename|new name.jpg}} and it will be fixed. The alternative is that someone (you) upload 1,5 mio. images manually. That sounds like an impossible project. --MGA73 (talk) 22:44, 31 January 2010 (UTC)
Disused stations images should be manually uploaded to ensure that they are correctly named and categorised. This is what I and others have been doing until now and it seems to work well. I've put in some requests for renaming so let's see if they are accepted. I also wonder if the bot has picked up other images and put them in non-disused categories - I just have to hope that some other user will find the right category. Ravenseft (talk) 11:32, 1 February 2010 (UTC)
Thanks for the quick response on the lake images. Overall it seems to work quite well. -- User:Docu at 21:47, 31 January 2010 (UTC)
This was Commons’ file #6,000,000

It took a lot of work to come this far and still a lot of work to do. But I think it was worth it. Now we have a lot of images of UK etc. (also look to the right) :-) --MGA73 (talk) 08:54, 1 February 2010 (UTC)

Geographbot[edit]

To centralize discussion, we might want to redirect its talk page (User_talk:GeographBot) here. Currently it leads to User talk:Multichill -- User:Docu at 10:51, 30 January 2010 (UTC)

✓ Done. Multichill (talk) 13:31, 30 January 2010 (UTC)

Sysadmin instructions for starting/resuming batches[edit]

To start or resume a batch you need to know its batch number, which is a number between 00 and 24. Batch numbers 14 and below are suffixed with _new . The following uses batch 09 as an example.

To start or resume a batch:

$ /home/catrope/bin/importBatch 09_new

Be sure to run this in a screen session on hume.

The importBatch script reports to #geographbot on IRC when it starts and stops. Sometimes, the import script will die with a database error and a batch will be reported as having completed even though that's not really the case. To check whether a batch has run to completion:

$ ls -r /home/catrope/upload/09_new | head -n 2
Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.txt
Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.jpg

Verify that Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.jpg exists on Commons.

Category:Fords[edit]

Fords are not the Ford Motor Company. Lukas 3z (talk) 15:51, 1 February 2010 (UTC)

Uploading over existing redirects?[edit]

File:Mill Lane Oversley Green - geograph.org.uk - 129468.jpg was a redirect before it got bulk-uploaded. Now a new image has been uploaded, but the page data still just has a redirect in it. I suspect this may have happened elsewhere too. --bjh21 (talk) 16:47, 1 February 2010 (UTC)

  • Looks like someone forgot to delete a misleading redirect. -- User:Docu at 16:51, 1 February 2010 (UTC)

Geograph links[edit]

Please see this edit. Surely it would not be too difficult to teach the bot how to convert Geograph's internal links properly? — RHaworth (Talk | contribs) 19:38, 1 February 2010 (UTC)

Should it not link to the matching file on Commons (File:Longford Pump - geograph.org.uk - 136930.jpg) then? --MGA73 (talk) 22:06, 1 February 2010 (UTC)

To some extent, depends how long the upload process is going to take. To do it thoroughly, one would check first if the upload here had been done. Failing that, they can all be created as links to Geograph and a bot can come round and convert them to Commons links as the images get uploaded. — RHaworth (Talk | contribs) 01:12, 8 February 2010 (UTC)

Category:Darlington[edit]

The bot identified images of Dartington as being of Darlington, and mis-categorised these images as a result (example). Not sure why.--Nilfanion (talk) 11:27, 2 February 2010 (UTC)

See the this tool link in the big yellow template for the source of this problem. Multichill (talk) 08:52, 3 February 2010 (UTC)

Category:Lakes of Cumbria[edit]

Commons' file #6,000,000.
Category:Ullswater has 76 files (up from 7 files)

Thanks to this upload, the category has now a much larger offer of images. A month ago, there were only 7 or 8 subcategories.

Similarly Category:Ullswater, home to image #6,000,000, is now at 76 files from 7 files a week ago. Not all of the additional ones are new though, but the existing ones were categorized (manually) with the new ones. -- User:Docu at 07:08, 3 February 2010 (UTC)

That's great! It's nice to use sumitup on the new categories, see for example Category:Angle Tarn (Langstrath). See User:Multichill/monobook.js for an easy link (should probably be a gadget btw). I have a simple bot to create categories based on Wikipedia articles, this also includes information from sum-it-up (used it to create all the village categories). If other people find this useful I can publish it somewhere. Multichill (talk) 08:21, 3 February 2010 (UTC)
Great work indeed. Sum-it-up is very nice, but would be even better if, for long descriptions, it encapsulated them in {{Mld}} or if that's too complicated, in {{Collapsed}}. --Foroa (talk) 09:42, 3 February 2010 (UTC)
Already asked Magnus, but he doesn't seem to be very responsive. Multichill (talk) 10:05, 3 February 2010 (UTC)
Maybe the five-year warranty expired and WikiMedia didn't pay for the its extension ;) If you send him a patch, he will probably apply it.
I agree that descriptions with sumitup would be nice to have, but I wouldn't mind a few tweaks to its layout either (e.g.). Ideally, an empty category page would fill itself directly with the text.
Additionally, for that to work here, I will have to write a series of articles first. Usually it's the other way round: there is a set of articles about lakes and they need images. Smile Obviously, once they exist, maybe a bot can create the category descriptions. -- User:Docu at 16:53, 3 February 2010 (UTC)

Cats to clean up[edit]

I made Commons:Batch uploading/Geograph/cats to clean up. Big categories should (and will) be first be checked by a bot, than it would be nice if users could help out. Multichill (talk) 22:10, 3 February 2010 (UTC)

This seems to be dumping a lot of images into Category:Buildings. Normally, I try to keep that category under control by finding better places to put virtually all images that are dropped there, but this is far more than I will take on. I hope someone plans to move these to some more geographically specific categories. - Jmabel ! talk 01:24, 5 February 2010 (UTC)
Hi Jmabel, did you see the part about categories first need to be checked by a bot? The bot reduced the number of items in Category:Buildings to a more acceptable level. Multichill (talk) 09:17, 5 February 2010 (UTC)
Can I just point out that the subject categories are not the only ones with severe problems. The high-level geographic categories have also been completely overwhelmed. (5 of the worst examples are: Category:England, Category:Scotland, Category:North Yorkshire, Category:Devon and Category:Aberdeenshire which currently contain 2276, 1612, 1151, 1060 and 1041 files respectively. Providing more precise location details is likely to be difficult for a bot as if a more exact location was available in the Geonames database then Geographbot would have be able to use that on initial upload.
Jmabel has hit my concern here, many of these categories are not maintained by large numbers of users so the addition of over 1000 images in a few days will take weeks to clean up.
The obvious bot fix to this problem is to make a bunch of category intersects (or use the existing ones), so instead of Category:Beaches and Category:Cornwall it would be Category:Beaches of Cornwall (The in/of issue needs addressing but that's not directly relevant to this). The intersection of major subject and county level information should work for the majority of images for the time being, and will prevent both the subject and location categories from being flooded. Ideally this should be done at upload and not have another (set of) bots running around afterwards. Could modifications be made to GeographBot before it starts on the next batch?--Nilfanion (talk) 03:11, 5 February 2010 (UTC)
As you might have noticed we created a lot of intersection categories prior to this upload. If you create intersect categories these will be used in future uploads (no need for modifications). To intersect current categories I created {{Intersect categories}} a while ago. I added it to Category:Beaches of Cornwall. I'm not sure if it will catch a lot of images, let's see.
As a side note. At User:Multichill/Location based categorization I put an idea about location based categorization and how it could be used and improved. Multichill (talk) 09:17, 5 February 2010 (UTC)
I agree that it would be better if right categories was added during upload. For that to happen we need 1) that information is on the image on Geograph so bot will know where to put it 2) all relevant categories should be created and categorized correct 3) bot should be adjusted correctly.
We tried to make the most relevant categories (Commons:Batch uploading/Geograph/cat-matrix) but for some reasons some images did not end up in the right categories. One of the problems is "bad names" for example top category is Category:Water wells but subcategories are named Category:Wells in England. Bot does not like that for some reason.
Right now we are trying to fix the problems with help from bots. So if you find categories with way to many images in then put a note here and we will look at it. If it is just a handfull of images please wait or do it manually. Any help is welcome. --MGA73 (talk) 09:26, 5 February 2010 (UTC)
I'd point out that I've already made ~500 manual edits relating to this (mostly from Category:Devon) though I'm doing fundamental work on improving that category and its tree as opposed to the just the simpler upload fix I suggested. There's some problems I'd like to see addressed:
  1. The cat-matrix didn't work perfectly, the water wells being a prime example. Another problem seems to be examples that got placed in * in England categories, when * in county already existed and was in the matrix. File:Bull Field - geograph.org.uk - 164427.jpg should have been placed in Category:Fields in Somerset but was instead put in Category:Fields in England, when the bot also correctly identified the appropriate village in Somerset.
  2. The need for hundreds of county-level intersects. Its hard to identify just which ones are required, but I'd suggest that if any subject category or one of the subject in England/Scotland/Wales/N Ireland (or UK) categories has more than 100 Geograph images at present then ensure that it has full county-level breakdown, and that the bot will use those. Even those county-level subject categories are going to become unwieldy after the complete upload (I'd estimate Category:Fields in Somerset is liable to have well over 500 images), but that is a reasonably precise category. Could you provide a full breakdown of the categories used by the upload bot?
  3. When the location tool was not working for whatever reason, the bot went ahead and just added to subject-only categories. If this occurs again will the bot do the same thing again or will it notice this and not upload (and try the tool again)?
Oh and as for categories with way too many files: Many of the subcategories of Category:Counties of England, Category:Counties of Scotland, Category:Counties of Wales and Category:Counties of Northern Ireland have in excess of 100 files, and Category:County Tyrone is exceptionally bad with 2,600+ files. If any category has so many files in it that only subcategories up to F are displayed I'd say its usability has been seriously affected and needs major work to fix.--Nilfanion (talk) 12:45, 5 February 2010 (UTC)
Such regionally intersected categories can help improving categorization by breaking it down to a reasonable size, but we shouldn't forget to find additional categories besides these basic ones. Many topics with "by country" structures have a detailed "by sub-topic" categorization can get neglected. -- User:Docu at 13:00, 5 February 2010 (UTC)

Oh and suggestion, when uploads start again please do a relatively limited test batch (~5-10,000?) to see if there are still further problems with categorisation or if its sorted. A bot shouldn't be making any errors like the Somerset field example I just gave - fact that it did is indicative of a problem. I like the idea of having many of these files (some are junk, but lots are useful), but if its making the UK location categories unusable that is serious harm.--Nilfanion (talk) 12:50, 5 February 2010 (UTC)

Agreeing with Nilfanion here. The Devon cat which I have worked on since I arrived here three years ago went from quite well organised to a bomb zone in no time at all. There is a stack of work to do and very few people dealing with this aspect. Care and thought before further dumps please --Herby talk thyme 13:19, 5 February 2010 (UTC)
Category:Gwynedd is an utter disaster. I've been working on the Category:Conwy County Borough and wondering why so few files are present. Now I know the answer - a large number of them are categoried in Gwynedd! The same is true for many files that should be in Category:Anglesey. The Gwynedd cat now has nearly 4,000 unsorted files, many of which are for places not even in that county; I can't even begin thinking about sorting that lot out. Something needs to be done. (PS GeographBot seems to live in the past: the old Gwynedd included these areas but the present county does not.) Anatiomaros (talk) 18:35, 6 February 2010 (UTC)
This seems to be a wide problem. For example (though on a much less scale), Category:Horsell, contains(ed) images from Woking, Mayford, Kingfield These are different places, and have their own cats. How any why does this happen. Especially, how does an image titled "Mayford Roundabout" end up in category:Horsell, several miles away? Arriva436talk/contribs 18:56, 6 February 2010 (UTC)
Gwynedd seems to be a separate problem that needs to be looked into. If one wants to look at images other than those from Geograph in the meantime, CatScan can help
http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php?wikifam=commons.wikimedia.org&basecat=Gwynedd&basedeep=1&mode=ts&templates=Geograph&untagged=on&go=Scan&format=html&userlang=en
Things like "Mayford Roundabout" probably happen when coordinates of the image are closer to Horsell than of any other place. -- User:Docu at 19:10, 6 February 2010 (UTC)
There are a couple limitations with that suggestion: 1 most users wouldn't know how do generate that catscan and 2 sub-categories like these are hard to reach. That's not to mention the additional strain on the toolserver all the queries may generate. I've got two idea here: 1 is remove all unreviewed files from the county categories and place them in a special sub-cat. 2 add a header template to the badly affected categories explaining situation and giving clear links to catscan and the 2nd half of the category list.
With respect to the Mayford Roundabout the bot ignores the filename and goes with what the geonames database returns. This causes an additional problem for commonly used place names (I've had to remove a bunch of images from Category:Luton as they were associated with Luton, Devon not Luton, Beds.--Nilfanion (talk) 19:30, 6 February 2010 (UTC)
OK, that makes sense. But how many other images that have been uploaded could be in the wrong category, in a similar manor to Mayford Roundabout? Also, when they are checked, is it likely for the checker to miss the fact it's in the wrong category?
Another issue (going slightly off topic), is how bad some of the Geocoding is. At Commons we aim for accuracy of 7 meters - I some cases the Geograph images are miles out. Going back to Mayford, there's a large amount of Mayford images that were taken in different places, but all show up as being taken in the same place in the middle of a field. Arriva436talk/contribs 20:04, 6 February 2010 (UTC)
The inaccuracy of the geocoding is inevitable to a point unfortunately. Geograph's goal is to photograph every grid, and most images uploaded there are placed no more precisely than that. When we translate that to here, those images get placed at the centre of the grid square, which in case of Mayford the centre of SU9956 is in some fields. There is no way to obtain more precise information, unless you can place them more accurately yourself using whatever your favourite geocoding tool is.--Nilfanion (talk) 20:23, 6 February 2010 (UTC)
To avoid that people get overwhelmed by geograph images in categories, I added a sortkey to Template:Check categories-Geograph. This should make these images appear after the already checked ones. -- User:Docu at 13:57, 16 February 2010 (UTC)

Linking[edit]

Seemingly as part of the Geograph batch upload, categories have been created, based on the en.Wikipedia article. These appear to take the first part of the lead of the relevant article, and convert the links to cross wiki ones. However, as on the original en.Wiki page (they don't link to themselves), the commons categories don't include the most important link - i.e. the link to the article the category is for. An example is Category:Burpham, Surrey, which links to everything except en:Burpham, Surrey. The difference can be seen by looking at Category:Merrow, Surrey, where I have fixed the issue. I know it does have the link in the toolbar down the side, but this isn't very obvious, especially when the description it covered in other links.

Another funny thing I've noticed is that two images taken in Burpham, and with Burpham in the file name, were in Category:Merrow, Surrey. Is there an explanation for this (like the bots takes the general area Geograph says, which in this case would be Merrow?) Arriva436talk/contribs 21:00, 4 February 2010 (UTC)

Top categories[edit]

At Commons:Batch uploading/Geograph/top categories I put a list of most populated categories. When most of the bot work is done at Commons:Batch uploading/Geograph/cats to clean up I'll shitft my bot's attention to these categories. Multichill (talk) 13:19, 6 February 2010 (UTC)

For roads, farms and forests, a regional break-down would probably help. I guess it's unlikely that this ends up in Category:Loch Ness? If yes, I will try to work on that. -- User:Docu at 13:54, 6 February 2010 (UTC)
One potential concern looking at those lists are ones like Category:Roads in Devon, where the subject/location intersect is already at county level but there are still nearly 1,000 images in that category. I'm not as concerned by that as I am by county or subject categories with too many files, but I'm less sure of how to tackle it. I can see two ways to split that:
  1. By subject: The Roads in * are in particular awkward as I can't see that many natural subtypes. We can split out the classified roads, but that doesn't address the large number of images of unclassified roads. We could also split out signs, roundabouts, junctions etc.
  2. By location: Could always take it down to the next tier - which would be districts (so have Category:Roads in West Devon etc).
I'm cautious as to to having subject by location categories with the location being that precise, I'd prefer greater precision in the subject so I guess take the first option as far as it can go and see how bad situation is then.--Nilfanion (talk) 17:47, 6 February 2010 (UTC)

Source[edit]

Two suggestions for the "Source" item: a) I hate to see links from the general to the specific and b) given that Geograph is all about grid refs, it seems a pity that grid ref info should be lost altogether. I suggest that instead of:

From geograph.org.uk

we should have:

from this image at geograph.org.uk for grid square SK0754

RHaworth (Talk | contribs) 02:28, 8 February 2010 (UTC)

Bad categories at upload[edit]

Obviously the bot is working within the limitations of the database and will make errors (be best if we could alter the db...). However, I've seen a number of problematic categories where the name relates to multiple settlements or completely different topics. In one specific instance, BotMultichillT readded an inappropriate location after I had already removed it once (an accurate location category has been provided).

List of ones I have found so far: Category:Luton, Category:Dartington, Category:Saint Agnes and Category:Bradford. As for how to handle this, perhaps have the bot check that the image is approximately the right for that location. If the location is more than 50 km from Luton, Bedfordshire then it probably doesn't belong in Category:Luton. That's a complication, but it should only matter for a relatively small number of categories - could have specific check for them?--Nilfanion (talk) 00:52, 16 February 2010 (UTC) Bound to be more, c

File name[edit]

Is there any need to put ' - geograph.org.uk -' in the filename? We do not insist on it when people do pure hand uploads, so why do we need it on bot or tool uploads? — RHaworth (Talk | contribs) 01:28, 22 February 2010 (UTC)

Yes and no. If someone upload a image without there is no reason to move images to get it in the file name. Same thing goes the other way around. Normally we let the uploader decide the name. Personally I think it is a good way to identify where the image came from and it looks like that Multichill thinks the same. --MGA73 (talk) 09:31, 22 February 2010 (UTC)
In general, I think it's a good idea to identify such sources in the filename, even when doing manual uploads. Personally, I don't think the chosen way is ideal, but just "Geograph" wouldn't have been of much use. -- User:Docu at 17:25, 24 February 2010 (UTC)

Image corruption[edit]

These ten Geograph images produced corrupt results here. Since it does not appear to be the bot's fault, I have reported it at the Village pump. — RHaworth (Talk | contribs) 01:28, 22 February 2010 (UTC)

I tried to save the image and upload it manually. I got an error (Files of the MIME type "application/x-php" are not allowed to be uploaded.) so it looks like it is not the bot. The easy way to fix it is do download files. Fix the problem and and click "Upload a new version of this file" at the bottom of the corrupted file pages. --MGA73 (talk) 09:37, 22 February 2010 (UTC)
These images should be fixed now. But would be nice if we could find all images like that. --MGA73 (talk) 10:02, 22 February 2010 (UTC)

Confirmed to be a known bug. No need to wait for the bug to be fixed - just edit the images as you have done. Yes, for the stuff done by the bot, we can find all corrupt images. They will be in this list of Geograph images where I have not found the corresponding image here. The list started with about 210 items. I have whittled it down and will continue to do so. But if anyone else wants to join in … The object is to find the image and add a valid {{geograph}} tag to it. — RHaworth (Talk | contribs) 17:08, 24 February 2010 (UTC)

ONO it's OLU[edit]

The bot has done a very thorough job - with one curious exception: it has refused to copy any of these 800 images uploaded by Geograph user 14997, OLU. The few images that have been uploaded have all been done by people. Why?? User 9685, Wilson Adams seems also to have been ignored. — RHaworth (Talk | contribs) 17:08, 24 February 2010 (UTC)

Geotags[edit]

This area seems to be a bit shaky. By chance I happened over to Durham, to upload some of my own images. Yes I had to play a bit with categories as I expected- Durham the city, in County Durham the county. Yes all the geograph images were geotagged but

  • a few were accurate.
  • a few were 112 m from where the photographer stood.
  • a few tagged the object accuartely
  • a few tagged the object but missed by 112 m
  • a few tagged the grid square- but the +- 500m precision made it difficult to see if the intended point was 112m out.

It looks if the bot is cascading down from WGS84 to OSGB to gridsquare This is fine but if we could tag the location tag about which it was a further bot could then do the Helmert conversion to wipe out the discrepency. I have put a comment on Commons talk:Geocoding asking for ideas.--ClemRutter (talk) 18:25, 11 March 2010 (UTC)


Status[edit]

Assigned to Progress Bot name
Multichill Did first 250.000, still 1 million files left to do GeographBot

The upload is complete and only categorization is now going on. Or do we expect more images?--Diaa abdelmoneim (talk) 12:54, 26 March 2010 (UTC)

I did the first 250.000, still 1 million files left to do. Multichill (talk) 17:43, 26 March 2010 (UTC)
Any idea on timescales on that? For what its worth I'm only a fraction of the way through sorting the categories from the first upload in the one county that I'm concentrating on...--Nilfanion (talk) 21:40, 26 March 2010 (UTC)
Not sure. Depends on the torrents being available. Multichill (talk) 09:46, 2 April 2010 (UTC)

Automatic categorization update[edit]

I modified the bot to use the OpenStreetMap tool (the one included in {{Check categories-Geograph}}). Looks like I'm now able to find better categories for the overcrowded categories. Commons:Batch_uploading/Geograph/cats to clean up and Commons:Batch uploading/Geograph/top categories will be rescanned. Multichill (talk) 19:30, 5 April 2010 (UTC)

Good news. It did seem to me that the link looked better than the result. Can it help fine tune some of the localities too? -- User:Docu at 19:43, 5 April 2010 (UTC)
Yes, and please be aware that the old "preserved county" of Gwynedd which the first batch uploads used is about twice the size of the present-day county and included all of Anglesey, the western half of Conwy County Borough and a small part of Denbighshire. Sorting that lot out is still not completed, although I think I've managed to "reclaim" most of the Conwy files : please, please be very careful that Geographbot doesn't undo all my work (many hundreds of files!) and that the same thing doesn't happen again. Also I'd agree that sometimes the localities are not always perfect but that's a relatively minor problem if the files at least end up in the right county category. Thanks, Anatiomaros (talk) 21:57, 5 April 2010 (UTC)
I just did Category:Churches in England. This reduced the number of images from 1900 to 400. I'm using a difference source (OSM) now. I hope it doesn't use old information so files end up in the right county. I just fired up Category:Roads in Scotland so keep an eye on it. Multichill (talk) 08:25, 6 April 2010 (UTC)
Would it be possible to have the bot stop putting images in Category:Cleveland, Ohio? - EurekaLott (talk) 14:58, 7 April 2010 (UTC)
Either nuke Category:Cleveland or create Category:Cleveland, England. Multichill (talk) 16:57, 7 April 2010 (UTC)

Anglesey, Gwynedd and Palestine[edit]

Gwynedd seems to have produced not a few problems for Geographbot. Bethesda is not the same as Bethesda: several dozen files for the Gwynedd town ended up in Palestine. The Bangor area seems to confuse the bot. Most of the files for the cathedral city of Bangor ended up across the Menai Straits in Category:Menai Bridge (in Anglesey). Conversely, a lot of files for the Bangor area ended up in Anglesey, including a file named 'Bangor Cathedral' which was put in the category for Llandegfan - a tiny village in Anglesey - and in the main Gwynedd cat and that despite the fact that it was correctly placed in Category:Bangor Cathedral, which is the only category it needed. Seems a bit bizarre! Quibbles apart, I think we should move on to the next batch(es) - better to have the remaining files now and get on with it, perhaps? Any news on the torrent availability yet? Anatiomaros (talk) 22:35, 2 May 2010 (UTC)

Use this trick to clean up disambiguation problems like this one. I proposed a move of Category:Bethesda.
I haven't heard anything about new torrents yet. I'll contact them again in the next couple of weeks. Multichill (talk) 11:32, 3 May 2010 (UTC)
Thanks, that seems to work. By the way, I've been working - slowly! - on the files cluttering up Category:Wales and one thing I've noticed is that most of them were placed in deeper cats as well, e.g. this one was placed in *Farms in Wales and also in the parent Wales cat, which is completely unneccesary and creates a great deal of work. I hope there is some way of fixing that for the next upload as the bot should avoid placing files in higher/parent categories except as a last resort, i.e. if it's not able to find a deeper category. Anatiomaros (talk) 18:30, 4 May 2010 (UTC)
An even more bizarre example is this file, correctly placed in *Carmarthenshire but also in both the *Wales and *United Kingdom categories... Anatiomaros (talk) 18:48, 4 May 2010 (UTC)
See the logic of the bot. If the overcategorization filter is acting up, you'll get these kind of results. Multichill (talk) 21:59, 4 May 2010 (UTC)
I added a static mapping from geonameId 2655804 to category:Bethesda,_Gwynedd. Multichill (talk) 10:27, 30 May 2010 (UTC)

Broken Geotags[edit]

And several others all tagged to the same incorrect location. There are several other bot mistagged geograph images in Southampton. I do not know how widespread this is, but it needs to be addressed. 188.222.170.156 17:50, 2 May 2010 (UTC)

The locations are probably just lacking precision. Geograph works by a number of squares on a grid. You can fix this by adjusting the coordinates. -- User:Docu at 18:04, 2 May 2010 (UTC)
Yes and No. Looking at the cigarette factory. the bot has captured the user coordinate- which is measured in OSGB36, then displayed it on the map which uses WSG84. No Helmert conversion has been done so we expect a 112m inaccuracy. Why the others are there is open to question.--ClemRutter (talk) 09:23, 3 May 2010 (UTC)
It would be interesting to know what's in the input used for the upload. -- User:Docu at 10:25, 3 May 2010 (UTC)
See top: these dumps of the database. Multichill (talk) 11:33, 3 May 2010 (UTC)
There are several other incorrectly marked locations in Southampton, some like the above, are all marked to one incorrect location. Here's an example of one on its own though, File:Redbridge Flyover, Southampton - geograph.org.uk - 28728.jpg, bot copied from here. The source states a location of 50°55.2873N 1°28.0549W (translates to DMS 50:55:17.238N 1:28:3.294W), yet the Commons upload translates this to 50:55:11.09N 1:28:4.39W. There needs to be a manual sweep of geograph bot uploads, because right now, it's poisoning the tools that use this information such as map layers. Suitcivil (talk) 21:09, 4 May 2010 (UTC)
Just spotted that something very similar is mentioned above at #Geotags. Suitcivil (talk) 21:16, 4 May 2010 (UTC)
At least in the case of 28728 (Redbridge Flyover) I beleive that is our (Geograph's) fault. Multichill's GeographBot, just copies that lat/long from the dump files mentioned above. Inspired by this thread I went digging and it seems that the Geograph code doesnt update the said lat/long, when an image is relocated within a gridsquare. So the lat/long for the said image is the old location before it was updated via the Geograph site. We've never noticed because its just cached there for the search engine, the actual photo page calculates it live from the easting/northing. The best I can offer to do is create a bot to check the lat/long columns (as used in the dump) are correct, and update it as nesserically. And create a new dump. Will also have the script output a changelog, so it can be used by a another bot to correct the coordinates within Wikimedia. I have no way of estimating how many images will be affected by this yet. (Oh and of course will fix the bug within Geograph !) Sorry all for the confusion! BarryHunter (talk) 23:30, 4 May 2010 (UTC)
I just checked the four images mentioned in the opening to this section, and yes all have been moved within the gridsquare, so will have inaccurate lat/long in the version uploaded by GeographBot BarryHunter (talk) 23:34, 4 May 2010 (UTC)
Look, we have a million and a quarter excellent images- and potentially a million and a quarter inaccurate geotags. The errors are multiple-File:Church Street, Shirley, Southampton - geograph.org.uk - 26621.jpg back converts to SU 39500 13500 giving 100m accuracy (or a hell of a coincidence) but going back to geograph we see that is the subject location. In this case it is also noted as the photographers location- but the previous one I checked the photographer was at a different location! Here, from the description, the photographer was at //location dec|50.9223|-1.4333// or SU 39929 13802, so again different.
Could I propose that we upload all the images as soon as possible but with a new template {{geoglocation}}- this would include all the geo information from geograph, and all the parameters of {{location}}. A bot could then be written that carefully verify the input fields- and when satisfied, copy the correct photographer information- accurately converted from OSGB36 alphanumeric into WGS84 {{location}} or add a textual comment that the object location was XXXX XXXX and the degree of precision. This would also allow these images to be put in a to be checked category. We can't do this at the moment by bot or manually with out refering back to Geograph page, as not all the relevant fields have been copied across.
I do propose a two part solution- as so many images are erroneous that one no longer has confidence any one of them being correct, and the UK Map is being saturated with errors. Once we have got all the data on hand we can discuss the best algorithm to use in order to correctly geotag them. Initially I would be happy if we could generate our own photographers lat/long by running the Helmert transformation on the photographers location grid reference. I have the code here in js [1].
Just as a summary, here are the errors I have found
  • Sloppy tagging. The Geograph user just typed in the wrong reference.
  • Geograph works to 100m precision- we attempt at least 7m precision
  • Geograph tags the object not the photographer
  • Geograph gives the object location in OSGB based alphanumeric grid references and WGS84: we use WGS84
  • Geograph gives the photographer location in OSGB based alphanumeric grid references only: we use WGS84
  • Geograph fills an empty photographer location field by copying in the object location
The good news is the Geograph and I get the same result when we run a Helmert conversion (OSGB36 to WGS84)
The cummulative effect of all these errors is impossible to quantify- but runs from 112m ± 100m through to several kilometres.--ClemRutter (talk) 14:31, 5 May 2010 (UTC)
The gridimage_geo database table, from the geograph dumps (already loaded by Multichill onto the Toolserver), contains the full eastings/northings for photographer location as accurate as geograph has it, many are 10m precision. This is more reliable than the wgs84_lat/long columns already used by GeographBot - due to the bug mentioned above. BarryHunter (talk) 17:08, 5 May 2010 (UTC)
I've corrected the column on geograph, and will recreate the actual dump files shortly. However have put the changelog here: http://data.geograph.org.uk/tmp_fix_log.mysql.gz - which someone can use to create a bot to correct the coordinates. I've put in the old coords too, so it can only replace them if somebody hasnt already updated them on wikimedia. I'd do it myself, but having never created such a bot, would probably just wreak more havoc! 776 images below id 250000 where corrected, so thats a ballpack figure for the number of images affected on wikimedia. (and yes this is still our subject location) BarryHunter (talk) 22:55, 5 May 2010 (UTC)
One comment which is related here. Regarding subject / photo locations: We (Commons) care about the photo location for geolocation purposes, but the subject location for effective categorisation... (So if different, don't throw out the subject location data).--Nilfanion (talk) 23:03, 5 May 2010 (UTC)
I'll put the temp table on the toolserver too and use it to correct the incorrect geotags here. Don't know when. Multichill (talk) 13:19, 22 May 2010 (UTC)
I have just discovered that commons has a {{object location}} tag- that may be useful. I really don't know any more, and I have never seen it in use. --ClemRutter (talk) 22:12, 22 May 2010 (UTC)
Ok, fixing geotags now. Multichill (talk) 15:27, 14 August 2010 (UTC)

Update may 2010[edit]

Hi guys, it's time for an update!

  • I spend some time on categorization. I'm now using the OSM source and the results seem to be much better. I'm thinking about combining both sources, but I'm not sure about that yet (Gwynedd problem would happen again).
  • No news on new batches. No new torrents available yet and no word from the Geograph guys yet. When we're going to upload new batches, the categorization for these new batches should be much better because the category tree is much more extensive now.

Thanks everyone for helping out! Multichill (talk) 11:11, 23 May 2010 (UTC)

I'd be curious to know how many {{Check categories-Geograph}} are still in place.

A few problems I've noticed with the categorisation, which you may want to think about before the next run:

  1. Category:Calstock (in Cornwall) and Category:Bere Ferrers (in Devon) are adjacent and the border between them is very complex. Almost all the files I've seen for that area have identified the location as Calstock, and the county (for subject cats) as Devon. It would be nice if it could get it right some of the time...
  2. The current re-categorisation is (inappropriately) adding city categories for files well outside city boundaries (Example - >5 km from the city boundary).
One thing that would definitely be worth exploiting is the Ordnance Survey OpenData; in particular the Boundary-Line product. When the geolocation of the file is right, correct interpretation of that data would identify the most precise administrative region (typically the parish in England - so village level), which in turn would correctly categorise it (most of the time). If it was only used to county level depth, it would guarantee the Gwynedd issue doesn't recur.--Nilfanion (talk) 11:45, 23 May 2010 (UTC)
I ve been chatting away at the HotCat pages in an attempt to speed up the cat checking process, by making HotCat more template friendly- with little success though. --ClemRutter (talk) 15:28, 23 May 2010 (UTC)
208911 files are tagged with {{Check categories-Geograph}}. I'm thinking about not adding it to new uploads, what do you guys think?
As for the incorrect categories: My bot is as good as it's sources. If the Ordnance Survey OpenData is freely available and of better quality it's sure worth to look into it.
But it's also possible to increase the quality of automatic categorization with the current tools:
  • I have a list of static mappings from geo id's to Commons categories. The full list of id's for the United Kingdom can be found here. I now changed the mapping of 2647716 (Gwynedd) from Category:Gwynedd to Category:Wales to prevent the flooding of the Gwynedd category. I will reread this page for more problematic locations (I remember something with Ohio) and will add them to the static mappings. Of course more static mappings can be added to improve categorization, it's probably useful to have all counties of the UK and Ireland mapped.
  • I have a list of imageclass to category mappings. It's probably worth checking the list of unmapped topics and map some more, but not a lot to gain here.
Multichill (talk) 10:15, 30 May 2010 (UTC)

Improved category intersection[edit]

Hi guys, I updated the bot which works on the categories tagged with {{Intersect categories}}, see Template talk:Intersect categories#Subcategories!. This should really improve our ability the split out crowded categories. Multichill (talk) 12:11, 30 May 2010 (UTC)

Conwy[edit]

Hi Multichill. I've just come across something which needs sorting out pronto. Some time since I last visited it, about 200 files have been added by bot to Category:Conwy. Conwy is a town, not a county. Almost all of these [new] files belong in Category:Conwy County Borough or (preferably!) its subcategories. Please could you sort it out and try to make sure it doesn't happen again? As I write this I'm also now wondering if the same thing has happened in the case of other categories where the county is named after a town, e.g. Category:Caerphilly / Category:Caerphilly County Borough. I hope not, as I've spent a lot of time trying to get some order in the Wales cats, although I can't go everywhere and do everything, of course. Best wishes, Anatiomaros (talk) 21:05, 12 August 2010 (UTC)

Of the files I've checked just now, it seems that they are not new files but just "old" ones marked as needing categories checked. Theses included one which I'd sorted months ago but somehow must have forgotten to remove the note - so although it was in three correct subcats (village name, *fields in, *rivers of) the [wrong!] parent category was still added by the bot. I've also had a look at Category:Wrexham and see that it's the same story there - couple of hundred files of countryside scenes that belong in Category:Wrexham County Borough. So is the same thing happening with Bridgend/Bridgend County Borough and Caerphilly/... ? I'm afraid to look :-) 21:42, 12 August 2010 (UTC)
Ok. I'm sorting some stuff out.
Multichill (talk) 13:36, 14 August 2010 (UTC)
Ok, did that. Now the next step is to rebuild the tree under Category:Caerphilly County Borough because the naming is incorrect and some categories are under Category:Caerphilly (town). Multichill (talk) 14:27, 14 August 2010 (UTC)
Thank you very much! That should solve the problem for future uploads as well. Anatiomaros (talk) 16:19, 14 August 2010 (UTC)

ughhhhhh!

Did I speak too soon? The following categories have appeared in Category:Geography of Wales:

Category:Geography of Bridgend County Boroughhh
Category:Geography of Caerphilly County Boroughh
Category:Geography of Conwy County Boroughhh
Category:Geography of Wrexham County Boroughhh

They have subcats. I haven't had the time to look as I've just "popped in" now, late in the day, but maybe there are more? Simple and hopefully isolated error by MGA73bot2 but they obviously need deleting. Anatiomaros (talk) 23:58, 16 August 2010 (UTC)

Huhhhhhh? I guess that's a typo. Will nuke them when I've found the correct categories. Multichill (talk) 06:11, 17 August 2010 (UTC)
Oooops! --MGA73 (talk) 06:39, 17 August 2010 (UTC)

Greedy UK[edit]

On another note, I was checking out Category:Wells in the United Kingdom today when I noticed that wells in the Republic of Ireland had been included in that category (and Category:Wells in Ireland completely ignored). I don't think that will make a very good impression on the Irish, especially if this has happened with other categories... Anatiomaros (talk) 21:09, 12 August 2010 (UTC)

Wells is problematic. City, water wells, a bit of a mess. The whole Category:Wells by country tree should probably be renamed to Category:Water wells by country. Multichill (talk) 12:54, 14 August 2010 (UTC)
Ok, did that for the relevant categories now. Multichill (talk) 14:59, 14 August 2010 (UTC)
I think his issue was more that Ireland is not part of the UK (well, apart from the northern bit). -mattbuck (Talk) 15:31, 14 August 2010 (UTC)
That was indeed the point. Perhaps I could have made it clearer. This touches on a much wider point regarding our having Category:Ireland for the republic and Category:Ireland (island) for ... Ireland(!). I've brought this up on Category talk:Ireland but should probably get around to opening a discussion elsewhere. That whole category tree is very confusing and misleading and contains a number of glaring anomalies (IMHO of course). Some input would be good.
By the way, whilst I've no objection to having *Water wells... for *Wells, this does mean that the UK cats are now at variance with all the other *Wells cats. Anatiomaros (talk) 16:17, 14 August 2010 (UTC)
I know, the other categories still need to be renamed. Multichill (talk) 17:04, 14 August 2010 (UTC)

The big one[edit]

Earlier this week I got a hard disk containing all the 1,8 million Geograph images + a recent dump of the Geograph database. I'm transfering the files to the toolserver now, I updated my local copy of the Geograph database and changed the bot a bit to reflect some changes. I'm about ready to generate new batches to be imported. Multichill (talk) 17:09, 14 August 2010 (UTC)

As the old military expression goes: "Incoming!". How many batches are you planning and will there be intervals between them? Good luck and tnx again for all your hard work on this project - these images have really transformed our coverage of Wales and Britain at Welsh Wikipedia and I'm sure the same is or will be true with the other Wikipedia editions. Anatiomaros (talk) 18:14, 14 August 2010 (UTC)
I'm starting at image 250.000 and will be working in batches of 10.000 files. These batches are compiled and tarred up so one of the shell users can download them for import. I'll probably just keep one program running depending on the speed. I hope to be doing around one batch a day, but I'm not really sure.
For locations I'm combining the result of the two source (Geonames and OSM). These locations + topics + a lot of filtering gives the final categorization results. A lot of categorization issues have been solved so at least we shouldn't be running into these. If new categorization issues arise we'll just have to tackle them.
Commons:Batch uploading/Geograph/cats to clean up and Commons:Batch uploading/Geograph/top categories will be updated every once in a while to hunt down overflowing categories. Multichill (talk) 18:30, 14 August 2010 (UTC)
One problem with the top category list is the number of X in county categories high on that list. For example, Category:Fields in Devon does not have meaningful subcategories (all there is is one geographic subset), so with the present structure should be incredibly bloated (and needs more precise subject categorisation).
It would be helpful if the lists were split into those that need more precise location categories (such as "X in England") from those that need more precise subject categorisation (such as Category:Cumbria or "X in County").--Nilfanion (talk) 18:56, 14 August 2010 (UTC)
Any news on when "The big one" is happening?86.169.41.141 10:49, 18 October 2010 (UTC)
Commons:Wiki Loves Monuments took a lot of time. I'm currently generating new batches. These will be slowly uploaded in the next couple of weeks. Multichill (talk) 11:00, 18 October 2010 (UTC)
I'm keeping track at Commons:Batch uploading/Geograph/Progress. Multichill (talk) 17:30, 21 October 2010 (UTC)

New upload problems: Slough[edit]

Hi, have just stumbled across some newly uploaded images from the BOT and have come to the fourth one that has had Category:Slough added to images on the East Riding of Yorkshire/Lincolnshire boundary. The problem is that Slough is in Berkshire which is miles away. For example take a look at File:Adlingfleet - geograph.org.uk - 251390.jpg. Keith D (talk) 19:32, 27 August 2010 (UTC)

I can't reproduce this. My assumption is that one of the two sources incorrectly returned Slough for some time. Multichill (talk) 13:46, 28 August 2010 (UTC)
I'm removing Category:Slough from all files in Category:Images from Geograph needing category review as of 25 August 2010. Multichill (talk) 13:51, 28 August 2010 (UTC)

Same thing has happened a second time. Category:Slough currently contains tens of misplaced images. Looking at a few at random, all seem to have been uploaded on 14 Dec 2010. Sorting out manually is possible, but would take a long time. -- PeterJewell (talk) 17:13, 10 January 2011 (UTC)

More upload problems[edit]

Hi I have noticed a batch of images of central London which have had the Category:Hertfordshire added to them, possibly related to the issue above? Here are the images I have found, I have recategorised the last two File:Houses of Parliament and the Thames - geograph.org.uk - 251103.jpg File:Houses of Parliament from the River Thames - geograph.org.uk - 252435.jpg File:Oxford Circus - geograph.org.uk - 254485.jpg File:Cable Street - geograph.org.uk - 253565.jpg File:Buckingham Palace and Victoria Memorial - geograph.org.uk - 251093.jpg File:A view from the Hub - geograph.org.uk - 253933.jpg Thanks for uploading so many images 88.109.13.229 19:45, 30 August 2010 (UTC)

OSM thinks that City of Westminster is in Hertfordshire. Multichill (talk) 19:55, 30 August 2010 (UTC)
Then they are insane (and we should ignore that stupidity) :)--Nilfanion (talk)

Fordon[edit]

Hi, can you relocate the Geograph images in Category:Fordon to the correct category at Category:Fordon, East Riding of Yorkshire. Thanks. Keith D (talk) 17:25, 1 November 2010 (UTC)

✓ Done, more fun at User:Multichill/Zandbak. Multichill (talk) 19:00, 1 November 2010 (UTC)
Thanks. Keith D (talk) 21:02, 1 November 2010 (UTC)

Problems ahead[edit]

I have just made a comment on this deletion as the outcome of the discussion there could have serious implications for the import of Geograph images on to Commons. Keith D (talk) 18:02, 5 November 2010 (UTC)

Why exactly? Multichill (talk) 22:05, 5 November 2010 (UTC)
De minimis and FOP, etc. There was a comment that similar files from this and other projects might have to be deleted. Personally I think it's an over the top reaction but as you'll see from the DR others disagree. Anatiomaros (talk) 22:15, 5 November 2010 (UTC)
If they delete the image in question then a very similar image of the same plaque from Geograph which would be loaded under the bulk loading would need to be deleted under similar rational. If this is the case then each of the Geograph images would need to be examined to see if it is acceptable or not. Personally I think they are going too far but we shall see. Keith D (talk) 00:52, 6 November 2010 (UTC)
We're talking about over 2 million images. Of course some of these images will be deleted for various reasons. All part of the game. I'm not going to care about that. The copyright paranoid people are welcome to examine all the images after I've uploaded them. Multichill (talk) 11:20, 6 November 2010 (UTC)
Yes, it's no big deal as a proportion of our 2 million files. Inspecting them all could keep the "copyright paranoid" busy for a year or two - if they'd like to add a few categories whilst they're at it that would be great. :-) Anatiomaros (talk) 00:58, 7 November 2010 (UTC)

Categorisation[edit]

I'm inclined to say we need a page specifically to report bad categorisation on upload: This page is getting too long and ought to be about more general concerns. Another example: Images of St Thomas are getting placed into Category:St. Thomas (in the Caribbean!)--Nilfanion (talk) 12:25, 13 December 2010 (UTC)

We could create Commons:Categories needing disambiguation? At User:Multichill/Zandbak I have a nice list to start with. Multichill (talk) 14:32, 13 December 2010 (UTC)
That's a starting point yes. However, the proposed system would not capture everything because some categories are correctly located and still get irrelevant Geograph images dumped in them. A couple examples:
In both these cases, the bots really should not be applying the completely inappropriate location cats.
My initial suggestion would be to add some sort of sanity check: Simplest would be don't add a location category if it is not a subcategory of in Category:United Kingdom or Category:Ireland. I realise that check would not work (Category:United States and Category:Australia are both subcats of the UK), but idea is viable to my mind - and doesn't need us to write a huge list of exceptions.--Nilfanion (talk) 22:31, 13 December 2010 (UTC)
Copied from User talk:Multichill
OK, I am getting a little irritated by GeographBot's stupidity on upload now. It will always put some images in the wrong locations as it will get the location wrong sometimes (especially as categories care about subject location not camera location) - that's not a problem. Putting rural snaps into the category of the nearest city (for example File:Valley below Membland - geograph.org.uk -_295892.jpg) is unfortunate, as is the bot failing to identify anything and just putting it in the county category. But in both cases, the categorisation isn't completely incorrect and whilst it should be avoided, the category it gets placed in is likely to be maintained by someone who can correctly categorise the file.
However, some of yesterday's batch have ended up in wildly incorrect categories. Geograph images should never be in Category:Melbourne or Category:Boston (but they are [4] and [5]). Not only are those categories severely wrong, their maintainers may not know what to do with a file that has no direct interest to them and they may cause problems for users. The first image I mentioned could be correctly used as a "Rural view near Plymouth", but if someone used this as a "Rural view near Boston, MA"...
(As you know) I've already mentioned this on Commons:Batch uploading/Geograph. However, if GeographBot continues to upload files and placing them in categories for different countries, I am inclined to block it as malfunctioning.--Nilfanion (talk) 11:48, 15 December 2010 (UTC)
End of copy
I already do a lot of sanity checking to keep crap out. Undisambiguated categories are really problematic because it's very hard to make a distinction between the different possibilities. Fortunately the number of categories not disambiguated on purpose are very small so we could just create a list of them and I'll just add it as (another blacklist). Do you happen to have an idea how high the error rate is for this specific error? Multichill (talk) 16:42, 18 December 2010 (UTC)
  • Pictogram voting comment.svg Comment Bad categories is annoying but if we blocked all users that does not add the right categories we would have to block a lot of users. Just have a look at Category:Media needing categories.
Unlike what many users that contributed to the category mentioned above Multichill and other users does A LOT to try to find the right categories for Geograph files so I see no reason why the bot should be blocked.
The bot do make some mistakes but once the files has been uploaded it is possible to correct many of the errors with a bot. All we need is someone to report "I found some files in category xx and it is not even in Europe." and then all we need to do is to find the right category and move all the files from the bad category to the right category.
Sometimes only local users can tell if a photo should be in category x or category y and as said abowe it is most likely that someone who can correctly categorise the file will do so.
Another alternative if we are not willing to accept some mistakes is to make the bot upload them in some "top category" and let users do all the categorization by hand.
So I suggest that we let the bot upload the files and every time a problem is found it is reported here so the bot can be fixed or a second bot can hunt and correct the errors. --MGA73 (talk) 17:03, 18 December 2010 (UTC)
  • The block bar for a bot is a lot lower, and this is part of why no one bot should do many different tasks, if it malfunctions in one role a block may be necessary but would stop it doing all the other things it does. Incorrect categorisation is different to insufficient or imprecise categorisation. A human who persistently adds an blatantly incorrect category to imagery would be liable to a block (for disruption), that's different from a human who is lazy and doesn't categorise at all.--Nilfanion (talk) 23:02, 18 December 2010 (UTC)
  • Pictogram voting comment.svg Comment I have a different take. Some place name in the English speaking world are derivative from a well known UK place name. Melbourne, Derbyshire; New York, Lincolnshire; Washington, Co Durham are places that have been so honoured. Can someone write a simple bot that any user can launch on the Root of a Category tree,(for example: Category:Nova Scotia that recursive steps through that tree- looking for filenames that contain the text geograph.org.uk. Finding one, it will edit the text replacing the current category name, with the text Category:Misplaced geograph image. (For instance, the bot in the example above will find Category:Halifax Regional Municipality and remove at least 13 rogue files.) If recursive scanning is too dangerous- then perhaps a manual confirm should be made before anything is written. --ClemRutter (talk) 19:52, 18 December 2010 (UTC)
  • My point really is how hard would it be to blacklist Category:United States and all subcategories (and Australia etc)? Identifying the correct category is more complex and a bot will always make errors. But if a category is a category related to the USA, its definitely wrong for a Geograph image and shouldn't be added. The bots should upload to most precise category possible, and then humans should fix it. We shouldn't use incorrect categorisation as an intermediate to aid maintenance. If that means high-level categories can/will get bloated that's unfortunate, but it is never incorrect (and future bot runs can go over the high-level categories to add a more precise location).
  • And speaking personally as a major maintainer of Category:Devon: I'd much prefer Geograph images of Devon that cannot be correctly located by bot being dumped in the county category instead of being scattered into incorrect location categories: I can fix them if they are there, but if they are in a random category I'll never find them.--Nilfanion (talk) 22:52, 18 December 2010 (UTC)
    • Blacklisting a category is easy, blacklisting a category and it's subcategories is hard. I still have a lot of batches in queue. I'm not going to recompile them, but I am going to apply some logic to see if it's possible to hunt down the images not under Category:United Kingdom. Multichill (talk) 21:18, 22 December 2010 (UTC)
      • I agree blacklisting a whole bunch of categories is tricky. Two complications to consider: Firstly, both Category:United States and Category:Australia (amongst others) are subcategories of Category:United Kingdom, so if an image is mis-sorted into a US category, it will still be in a subcategory of the UK cat. And don't forget about Ireland :)--Nilfanion (talk) 22:23, 22 December 2010 (UTC)

I've now blocked the bot, see Commons:Administrators' noticeboard/Blocks and protections#GeographBot. I'm be happy for block to be removed if there is a concrete fix in place to sort mis-categorised files (or prevent it happening in first place (please?)).--Nilfanion (talk) 23:04, 22 December 2010 (UTC)

General feedback[edit]

Sorting Geograph images is a 3 stage proces really:

  1. Initial categorisation at/before upload
  2. Recategorisation by bot
  3. "Final" tweaking by human.

I have been doing extensive work on that "final" stage and can draw some conclusions from what I've seen:

  • In general, the subject categorisation (houses/fields/trees in X) is fine, though additional subject categories may need adding.
  • The location categories for images of cities, towns and villages are usually fine - though city cats are inappropriately added to the surrounding rural areas.
  • However, the error rate for categories of rural imagery is much higher.
  • The issue raised above with mis-categorised files is not that common in terms of raw numbers - the problem is the severity (if it is not localised to the correct neighbourhood its much harder for users with "local" knowledge to sort).

Files can be have an incorrect location cat in three ways: Non-location categories being treated as a location such as Category:Treen, the Melbourne situation and clearly wrong locations within the UK (Two examples: 1 - the village is several km from the photo and there are other villages between them and Category:Corntown - a Welsh village with English content in the cat).

Incidentally when the file is miscategorised, I think the bot is already picking it up. In general, even if the location identified is "wrong" it still adds the correct x in county image - the example I just gave correctly placed it in Category:Moorlands in Devon, even though it applied a Somerset village category. If it merely identified this conflict and just added the Moorlands in Devon cat it would be correctly (if inadequately) categorised.

The error rate for rural imagery is significant, and is a natural result of the "nearest village" algorithm. In my last 100 edits in file space (almost all Geograph checks) ~20% corrected the location. If you bear in mind this is really a problem for rural stuff - if this sample of files is representative then the error rate on rural images may well be closer to 50%. These figures are high enough that I cannot trust the bot categorisation and have to verify manually.

Oh and what I mean by error in this context: The bot identified location is not in the correct civil parish. For example, File:Haws near Butland Wood - geograph.org.uk - 274186.jpg is a view of Modbury parish (as can be verified from OS mapping) but was categorised on upload as being in the adjacent location of Kingston. Its correct to say the image is of Modbury, it is not correct to say its of Kingston.--Nilfanion (talk) 10:42, 23 December 2010 (UTC)

Cleveland[edit]

I'm getting tired of cleaning up after the ill-behaved GeographBot, and it's making me very cranky. Because the bot is unable to determine that Cleveland, Ohio is on another continent, it's dumped hundreds of images into Category:Cleveland, Ohio over the past few weeks. I've dutifully removed the category each time. I eventually grew sick enough of mopping up that I created Category:Cleveland, England, as suggested above, despite the fact that nobody in England had bothered to do so. Now that the category exists, the bot is ignoring it entirely, and dropped another batch of around 130 images into Category:Cleveland, Ohio. This is getting ridiculous. - Eureka Lott 00:36, 31 December 2010 (UTC)

That's because someone redirected Category:Cleveland to Category:Cleveland, Ohio. This will move the images. Multichill (talk) 06:19, 31 December 2010 (UTC)
In this case, GeographBot probably should not be adding Cleveland at all, as its a former county that does not exist today. That category should be populated by the relevant sub-categories. File:Borough Beck, Helmsley - geograph.org.uk - 331844.jpg looks typical - Cleveland doesn't get mentioned in either of the geonames or OSM datasets, and the bot is placing them correctly in the appropriate N Yorkshire village and subject categories: In this case just don't add the category at all.--Nilfanion (talk) 10:39, 31 December 2010 (UTC)
Oh and this problem was reported in July. This shouldn't be a problem with the current uploads (and Category:Cleveland should not be a disambiguation page).--Nilfanion (talk) 10:44, 31 December 2010 (UTC)
The disambiguation page was created today. Is there somewhere we should discuss this? - Eureka Lott 17:25, 31 December 2010 (UTC)
@ Eureka Lott, I understand your frustration. If you find other problems like this just leave a note here. Bots can clean up.
@ Nilfanion, this sounds strange. Untill a permanent solution is found I think the best is to add the images to Category:Cleveland, England. Then users can either move them manually or a bot can try to find a better place. --MGA73 (talk) 11:09, 31 December 2010 (UTC)
The bot is finding the better place already - look at the images in Category:Cleveland, England - they all have more precise town/village (but with accuracy issues, the 20% error rate mentioned in previous section) level categories, so in this case removing Cleveland altogether is OK. Its probably worth humans adding the appropriate towns to the Cleveland cat. Not a clue why the bots are using Cleveland when its not mentioned in the tools...--Nilfanion (talk) 11:20, 31 December 2010 (UTC)
I did report the problem here in April, but was brushed off by Multichill, as if it was a problem with the redirect and not his bot. - Eureka Lott 17:25, 31 December 2010 (UTC)
This upload is a big project and there are thousands of things to fix so I'm not surprised if a few things is not fixed the first time they are reported. Well in the future we can perhaps all do better. Also this is a Wiki so if a redirect causes problems just fix it :-) --MGA73 (talk) 13:09, 1 January 2011 (UTC)
Both of Multichill's proposed category changes were implemented (the creation of Category:Cleveland, England and the conversion of Category:Cleveland to a disambiguation page), yet the bot is still adding files to Category:Cleveland, Ohio. Is there any point to reporting problems here? - Eureka Lott 20:19, 2 January 2011 (UTC)
Hm... We can always try this [6] to remove the files in the wrong category while Multichill is trying to figure out what the problem is. --MGA73 (talk) 20:55, 2 January 2011 (UTC)
I asked Multichill yesterday and he said the problem is this: Geograph upload is done in a number of uploads. Each upload "package" has to be compiled before upload. Some of the packages was prepared in august (before the problem was fixed). There is still a few of these "old" packages left. So one option is delete the packages that allready has been prepared and compile them again (it takes a lot of time). Another option is to upload the images and fix the problems when packages are uploaded (takes much less time). So that is why bot still uses the old category.
As you can see from the "trick" I did above it only takes a few minutes to move all the images from the wrong category to the right one once images is uploaded. So we should just make a note of categories with problems and fix for new uploads and get a bot to clean up the old uploads. --MGA73 (talk) 18:28, 3 January 2011 (UTC)
Same concern with category:Moscow. I cannot imagine the reason why Scottish photos end up there but yes they do every week or so (example). ??? NVO (talk) 07:31, 3 January 2011 (UTC)
If you click at the link this OpenStreetMap tool in the category check box you can see that it says "<hamlet>Moscow</hamlet>" = en:Moscow, East Ayrshire / Category:Moscow, East Ayrshire. A bot can move the photos tho the right category. --MGA73 (talk) 18:11, 3 January 2011 (UTC)
Yes, it can. No, please don't. NVO (talk) 22:11, 3 January 2011 (UTC)

Disambiguation problems[edit]

I want to attack the disambiguation problem. For that I need a list of problematic categories. Please add them to Commons:Batch uploading/Geograph/Disambiguation problems. I'll go through this page to see what categories were already mentioned, but I might miss some. There will be two approaches:

  1. Cleaning up after upload (for the already uploaded batches and the batches already compiled).
  2. Prevent new images from ending up in the wrong categories.

Multichill (talk) 11:11, 1 January 2011 (UTC)

This can only ever be a partial fix (but still helpful): For the simple reason that it is reactive not preventative, and it relies on users reporting the bad cats.
A less serious but more frequent problem is bad categories within the UK. For example, File:River Piall in Slade Park - geograph.org.uk - 275016.jpg and File:A48 - Brocastle - geograph.org.uk - 286704.jpg are close to two different Corntowns. In this case a straight disambiguation (to Corntown, Vale of Glamorgan and Corntown, Devon) is reasonable, but in some the major use is going to be overwhelmingly more important and so should not be disambiguated at all (Luton for instance). Some mechanism to report/handle these is also needed,
Incidentally the "when to disambiguate" question really should be handled by community-at-large, so I'll start discussion at VP later. I'm not convinced "disambiguate absolutely everything all the time" is optimal. Started discussion on this point at VP: COM:VP#Disambiguation of categories.--Nilfanion (talk) 23:04, 1 January 2011 (UTC)

Melbourne images[edit]

Can you please stop adding Geograph images to the Melbourne category which is in Australia! These images should be placed in Category:Melbourne, East Riding of Yorkshire. Can you move all of these across? Thanks. Keith D (talk) 18:52, 10 January 2011 (UTC)

Arras images[edit]

Can you pleas stop adding Geograph images to the Arras category which is in France! We have no settlement category for this and these images should be distributed in to the appropriate categories Category:Etton, East Riding of Yorkshire, Category:Goodmanham or Category:North Newbald depending on the civil parish that image is in. I have relocated the images from the recent upload. Thanks. Keith D (talk) 18:52, 10 January 2011 (UTC)

Slough images (again)[edit]

Same thing has happened a second time. Category:Slough currently contains tens of misplaced images. Looking at a few at random, all seem to have been uploaded on 14 Dec 2010. Sorting out manually is possible, but would take a long time. Please resolve this soon. Thank you. -- PeterJewell (talk) 13:37, 12 January 2011 (UTC)

Preston[edit]

Category:Preston is taking a hard hit- gathering images from Brighton, Kent, Devon and LB Brent- each of these counties has a hamlet of Preston that may be doing the damage. I have knocked off a few in passing. --ClemRutter (talk) 00:10, 13 January 2011 (UTC)

There are some for the town in the East Riding of Yorkshire that I have pulled from the cat, may be more. Keith D (talk) 12:04, 18 January 2011 (UTC)

Process of checking categories with HotCat[edit]

Just a comment for the next time we process a million images. I know of no way of removing the {{Geograph- please check cats}} template while just using HotCat- so I am just doing the obvious changes but failing to remove the tag- which is not wrong but doesn't need to be there.--ClemRutter (talk) 00:10, 13 January 2011 (UTC)

Characters in file names[edit]

The Commons filenames are derived from the file name on Geograph - most of the time these are ok, but some may need tweaking because they aren't that useful. No way that can be sorted out until human review of course.

However, there are a few non standard characters in Geograph file names, for example File:King’s Nympton, towards Highridge - geograph.org.uk - 267515.jpg. The use of the non-ASCII apostrophe makes linking/using the file more awkward and the character may not render properly for all users. Could that character just get mapped to the standard apostrophe ' ? (Incidentally, some archiving might be nice)--Nilfanion (talk) 23:17, 17 January 2011 (UTC)

Gay Street[edit]

I'm confused as to where the images in Category:Gay Street should belong to. It's nice to see stretches of farmland in downtown Manhattan but ... :)) NVO (talk) 22:36, 23 January 2011 (UTC)

Oh, that's a good one! I've sorted the files (moving them to Category:Pulborough (or its subcat Category:North Heath) or Category:West Chiltington as appropriate - as these are the relevant civil parishes. Not one of those images can be described as an image of Gay Street, West Sussex (so no need for Category:Gay Street, West Sussex).--Nilfanion (talk) 00:01, 24 January 2011 (UTC)
Is there any mileage in suggesting that Manhattan should apply to become a twin town to Pulborough? There was an image showing common roots. :-) --ClemRutter (talk) 10:08, 24 January 2011 (UTC)
Properly disambiguated by now. --Foroa (talk) 12:14, 24 January 2011 (UTC)

Indefinitely on hold[edit]

Over the last year I spend a lot of time on this project. In this project we got a lot of images, it was a fun thing to do and I got a lot of positive feedback. Over the last couple of months this changed. I got a lot of negative feedback from a small group of people and barely any support from the people who like this project. With this topic we hit rock bottom, this project is no fun for me anymore so I'm waste my time on something else. I hope you're all happy with that. Multichill (talk) 12:30, 24 January 2011 (UTC)

100% support here- the images I am waiting for are after 733,000- with a the most modern uploaded yesterday. I try to keep my head down when it comes to squabbling with rogue admins- I generate material on Wiki, and upload geotagged photos to Commons and a short spat on the Admin pages saps you of time and the will to live. There is only one criteria- will it be used- not has some modern town named itself after a village that has been existence for two millenia and then is surprised if they receive images of their namesake in their cat! So all I am asking is that you restart the bot- but run it a lot lot faster- in return there is coffee on stove and a warm welcome should you pass by --ClemRutter (talk) 14:33, 24 January 2011 (UTC) Rochester, Kent.
I'm sorry to hear this. I'm no admin and don't really have the nessasary debating skills to do much about this, but know this; your work is massively appreciated by many silent users. I'd like to be able to change your mind but am unable to do anything about a few noisy nitpickers Oxyman (talk) 16:22, 24 January 2011 (UTC)
  • Damn... --MGA73 (talk) 19:42, 24 January 2011 (UTC)
...and double damn. Please reconsider, Multichill. I can well understand how you must feel but this is quite possibly one of the most important projects we have ever seen on Commons. I have used many GeographBot images on Welsh Wikipedia and have even created articles I'd probably not have started had I not come across an image here whilst categorising these files. Your work is transforming the Britain and Ireland geo-cats and is greatly appreciated by many of us, here and on the various wikipedia editions. Some problems with categories are just inevitable in a project of this size - and I've dealt with my share of them - but responses like "block the bot!" are not warranted. Between that and the fact that some people seem to be here on a mission to have as many images as possible deleted on the slightest of technicalities, I sometimes wonder what is becoming of Commons these days. Anatiomaros (talk) 20:02, 24 January 2011 (UTC)

Suggestion to resume upload[edit]

I would really like to se the rest of the files uploaded. In my opinion a few wrong categories are acceptable but sadly not all users share this idea.

So I suggest we upload all the files but do not categorize them with a bot. Instead they are places in a category like Category:Uncategorized files from Geograph. Once the files are uploaded users can work on the files manually or perhaps someone can design a bot than can categorize some of the files (once they are on Commons EVERYONE can work on them).

So if you support this idea please add your name below and hopefully we can get Multchill to upload the files.

  • Symbol support vote.svg Support --MGA73 (talk) 20:29, 25 January 2011 (UTC)
    I talked this over with MGA73 on irc. The main point of concern is categorization. So let's do it in two parts:
  1. Rapidly upload the remaining images, but don't categorize them, tag them with {{Uncategorized-Geograph}}
  2. Slowly work on the uncategorized Geograph files and improve the categorization algorithms even more
The code for the second part is public so anyone can run it or improve it. Multichill (talk) 20:38, 25 January 2011 (UTC)
+1 to this for what its worth; I'd support a rapid, uncategorised upload if that means the exercise can be done more quickly (the categories are the bottleneck anyway - distributing that stage more widely seems like a plan).
Incidentally, there has been one major change in the background since the start of the uploading: The OS OpenData release. Furthermore, (I've just noticed) it has been converted into a reverse lookup system: This file gives this, which correctly identifies it as Modbury, unlike Geonames, which incorrectly goes for Kingston, whilst OSM picks up Lower Torr (a hamlet in Kingston parish). If a bot made use of this service, it would accurately identify the civil parish (in England) on 99.99% of occasions. As I've mentioned above, the civil parish is a "correct" location.
I can also see a method using that database, that (for England and Wales only unfortunately) will correctly categorise nearly all files and avoid all the bad location categories mentioned in all threads above. That is, create a matrix that links all civil parishes and communities to their MaPit ID (or the ONS one) and to their category on Commons. Then a bot could look up the coordinates on MaPit, extract the ID and get the correct category.--Nilfanion (talk) 21:15, 25 January 2011 (UTC)
  • Pictogram voting comment.svg Comment Could we use a minimal level of categorisation (as opposed to no categorisation), eg by making use of the grid square and tagging with {{Uncategorized-Geograph|TQ1234}}? That would break down the hundreds of thousands of files into a manageable amount; and minimal location info makes manual sorting possible.--Nilfanion (talk) 23:33, 25 January 2011 (UTC)
  • Symbol support vote.svg Support Oxyman (talk) 03:21, 26 January 2011 (UTC)
  • Symbol support vote.svg Support -- this has been wonderful in grabbing so many useful images (and saving us a huge amount of time uploading them manually). Grabbing them into an 'uncategorized' category and processing them later seems like a very sensible idea if that gets them all here quicker. -- PeterJewell (talk) 21:46, 2 February 2011 (UTC)
    • The batch uploading resumed. All files will be uploaded in batches of 10.000 files. Between each batch there is a 45 minute sleep period to not overload the servers. It probably takes about 10-15 days to upload the remaining files. The files will end up in categories like Category:Images from the Geograph British Isles project needing categories as of 3 February 2011 and a grid square category. Multichill (talk) 13:14, 3 February 2011 (UTC)
      • That is good news- if you want any help/support in future ping my talk page. --ClemRutter (talk) 19:36, 3 February 2011 (UTC)
        • Good news indeed, but might need some attention to create/redirect alld grid related categories as in Special:WantedCategories. There is no way a bot could create them so that only the faulty ones need correction ? --Foroa (talk) 07:52, 4 February 2011 (UTC)
      • Will it be possible to run a bot on the geograph pictures already uploaded, so that all are placed in grid square categories? That would make checking for correct locations so much easier. -- PeterJewell (talk) 16:01, 4 February 2011 (UTC)
  • I've created Category:Images from the Geograph British Isles project needing categories by grid square so we can put the grid square cats in one place, it may be logical to subdivide that further. I think it might be an interesting "game" to clear the square cats :)
  • Oh and a question - what is the final file in the batch? Geograph is an active project and there will have been many more pictures added to its database since the start of this (which won't be in the batch).--Nilfanion (talk) 21:48, 4 February 2011 (UTC)
  • Pictogram voting comment.svg Comment Not knowing any better, a few hours ago I started to create a bunch of the missing "by grid square" categories (example — note the sortkey), placing them in Category:Images from the Geograph British Isles project needing categories by grid square. Now that I see how many new "wanted" subcats are still being (not) created, I'm stopping my "manual" work on this. As discussed above, we definitely need a bot to go through Special:WantedCategories, identify the relevant missing cats and create them. You can see how I was starting to do it. If anyone has a better idea about how to categorize the new subcats (maybe also linking back to the grid pages at geograph.org.uk, which I wasn't bothering with), please suggest it. - dcljr (talk) 09:09, 7 February 2011 (UTC)
  • One for bots to do it is just create categories for all grid squares in the right 500km2. After the upload is completed, go through and delete the empty squares. With additional thought, we can reduce the amount of pointless creations/deletions significantly: For example, all photographed squares in SX are north of SX3833, so no need to create any of SX**00 through to SX**32 (saving 3,200).
  • The grid squares are shown at [7] - 75 in all. Some of those have a very low number of squares with images: I've created all necessary cats for MC, OV, HW and HX manually. Some will need 10,000, 13 of them (NN, SE, SJ, SK, SO, SP, TL, SU, H, M, N, R and S) are nearly 100% land.
  • Alternative approach, could be to look at geograph's metadata and figure out which squares are needed that way? (I'm not sure on feasibility there).--Nilfanion (talk) 11:30, 7 February 2011 (UTC)

Mangled image[edit]

I just wanted to let you know I have re-uploaded File:Rolls_Royce_Cars,_Hellens,_Much_Marcle_-_geograph.org.uk_-_583392.jpg from original source - the bot uploaded some binary data of the same length, but that was not the image. You might want to investigate this, if you have time.  « Saper // @talk »  01:57, 14 February 2011 (UTC)

Adding categories- going on a wikiwalk[edit]

During the last week I have been spending many pleasant and nostalgic hours walking in the Peak District-- all from the comfort of my computer. I am now at the stage of taking off my boots, having a pint and discussing the many things I have seen.

Geograph photos are brilliant... but back to categorisation and a few rambling thoughts! Firstly in lowland Britain many towns have no displaced photos but cathedral cities seem to include photos of every church in the diocese, maybe 50 of so from villages 40 miles away. Would it be acceptable to create a category Category:Churches in the diocese of Rochester then Cat-a-lot the whole batch there?

When using Cat-a-lot you cannot remove the geograph-template. Do we need to create Category:Geograph photos that have been examined but are awaiting a a bot to remove the tag? It is time consuming to process them individually with Hot-Cat in (++) mode.

Secondly, up in the Peak District- I moved to Category:Glossop. I have previously worked and photographed here. I have spent time drilling down the sub cats so the the town had a sensible number of settlement related photographs. I forget the number of geophotos I how found here- maybe I have moved about 120. It is understandable as the town of Glossop is the last sizeable settlement below the hills in the Peak District National Park. One shot was 14km away and in all, it seems to cover about 150 sq km of moorland. OK finding the location of a shot called Peat bog.jpg or Bridge over troubled water was fun but... Well I have categorised most of them using existing Cats, such as Category:Bleaklow, Category:River Kinder and created a few more such as Category:Shelf Moor, Category:William Clough. It is here that I am starting to have questions about the correct way to cat moorland, I guarentee I will have got it wrong and then what location category do you use.

Category:Moorlands above Glossop would be an easy way, though not quite accurate.
Category:Tintwistle CP uses the notation from the OS map CP being civil parish- this could separate village shots with shots from the area that is nominally administered by a parish council that has taken its name from that village. (I can't find a Glossop CP- as urban areas are not parished and a parish council may chose a name other than that of the village where they meet; this is not a surprise, but is not BOT friendly). Should we do this? To an outsider CP could be seen as a anti wiki? Do we create artificial CP categories?
Category:Bleaklow,Category:Shelf Moor- does one put both as Shelf moor could be seen as a small part of Bleaklow. Is is better to classify the moor, or the brook, beck, bach, Clough draining the moor, such as Category:Shelf Brook. I have gone for both. I but the brook as a subcat of the river it feeds into, and Category:Streams in Derbyshire? Does one go further and create Category:Streams draining Bleaklow? Do we whinge about the cat Category:Moorlands in Derbyshire when the plural of Moorland is Moorland, and as a natural feature it should be Category:Moorland of Derbyshire?

Now we come the other images on Google maps- the ones transfered more recently. Many advantages but a pain to categorise.

There is nothing there to work with as none of the Geograph cats have been transferred. Each image has to be processed manually. We have all the angst about the location of the square :Category:Tintwistle CP :Category:Crowden Great Brook as above, and then no clue what the image is: Footpath, Wooden bridge, Wheatear, Moorlands in Derbyshire, Peat, Heather. Is there sufficient geograph metadate to run a bot- to add all the categories other than location which seems to have been the biggest sticking point in the past?

So here are a few thoughts I like to share, before I put my boots back on again 'toddle yam o'er them moors' --ClemRutter (talk) 11:25, 12 February 2011 (UTC)

Now that's a lot of stuff. To get through the various component questions:
  1. The problem with cities is not related to religion and would probably not align with the diocese boundaries anyway, so that cat-a-lot idea isn't workable IMO. It is a result that the coding for cities is "greedy" and includes a lot of the surrounding countryside. Some cities are much worse than others, Plymouth was exceptionally bad before I fixed it all these were categorised to Plymouth, but most are not of the city.
  2. On the moorland / CP issue: Personally I like everything to be placed to the correct parish (or better), so all hamlet cats are included in their parish categories. This makes the CP a useful way to search and exploit via category intersects. This does mean that in some cases (eg Category:Peter Tavy), countryside stuff overwhelms the few snaps of the village proper. This could be handled by splitting the parish from the village, if this is done Category:Peter Tavy (parish) is better than Category:Peter Tavy CP ("Peter Tavy CP" doesn't exist but the CP called "Peter Tavy" does, but needs then disambiguating from the village), and this is not an artificial construct as it represents a real area. The unparished area of Glossop is an equally well defined area - just don't call it Glossop (parish) because it isn't :)
  3. As for the subject categorisation: Do what seems most appropriate. Create categories if there's an article on WP on the subject, which would benefit from it. Don't bother creating intersection categories unless/until the parents are too bloated, in which case do so. Don't lose sleep over it in any case, otherwise you'll go round in circles for months. In the case of Derbyshire, "Streams in the Peak District" is a natural sub-cat. Think about it in urban terms: At what point would you create Category:Darnley Road, Rochester?
  4. The recent transfers do include the grid square, which is better than nothing. The grid square readily gives the civil parish (or whatever), so potentially provides greater accuracy for the localisation - You could use Cat-a-Lot to move everything in Images from the Geograph British Isles project needing categories in grid SE2933 to Category:Leeds, which is where GeographBot would probably have put them. They would still need subject categories, so shouldn't have the "give me categories" tag removed. The bots were always pretty good with subject categorising, they (nearly) always get the <subject> in <county> category right. Extracting the subject tag from Geograph would help there, and could always be done by bot either on upload or later. Remember, the subject categorisation always benefits from a manual look, as Geograph only records one subject matter, and images may relate to multiple subjects - no bot can add the info when it just isn't there.

Think that covers it :)--Nilfanion (talk) 19:07, 12 February 2011 (UTC)

I think there is enough there to formulate an advice note.
New problem: I found a source square. Category:Images from the Geograph British Isles project needing categories in grid SJ9504- I found the target cat Category:Essington- I fired up Cat-a-lot- Selected all, and clicked on move, wheels whirred - and it processed all 13 files then spat up -
Done.
All pages are processed
Return to Page
The following files were skipped because the old category could not be found.
Then the names of 13 files
Please say its just me! --ClemRutter (talk) 19:27, 13 February 2011 (UTC)
It hasn't gone away! --ClemRutter (talk) 01:10, 21 February 2011 (UTC)
Unfortunately, that's a problem with Cat-a-Lot. The grid square category is actually not on any of the file pages, but it is included via the {{Uncategorized-Geograph}} template. You may want to bring it up at Mediawiki talk:Gadget-Cat-a-lot.js and see if the developers of the tool can help.--Nilfanion (talk) 12:27, 24 February 2011 (UTC)
I see you have been over there in the past. I have added a request for help, as i think that with 1.5 million files to process we have a good case. You are right, there is nothing resembling a category- either cat-a-lot must be liberalised, or we need a BOT to trawl through the database --# searching gridref=AA9999 writing to file Category:OS grid AA9999- but BOT writing exceeds my pay grade. We can hope that User:DieBuche comes up with a solution. --ClemRutter (talk) 18:04, 24 February 2011 (UTC)

OK I did an auto-deletion request thing and didn't realise the deletion request would be here. The only reason I was suggesting the image should be deleted is because it appears to be exactly the same as File:PSndwnTB34.JPG, the uploader on Geograph obviously chose to upload it here to Commons as well and there's little point in both being on Commons. Editor5807speak 11:54, 23 February 2011 (UTC)

In future is it possible to copy & paste the geocoding from the Geograph image onto the image to be kept so that as much info as possible is retained? Oxyman (talk) 23:59, 23 February 2011 (UTC)
Pictogram voting keep.svg Fixed -- Common Good (talk) 19:52, 28 February 2011 (UTC)

Small Thumbnails[edit]

The bot seems to have uploaded a cuple of small thumbnails rather than the actual images File:Arrival at Aberystwyth - geograph.org.uk - 580828.jpgFile:Vale of Rheidol Railway - geograph.org.uk - 775530.jpgOxyman (talk) 18:31, 1 March 2011 (UTC)

Ah, the archive we sent to wikimedia containing the images, may contain a small number of thumbnails like this. The filenames would be ending ..._60XX60.jpg . Ideally the box should exclude any filename containing "X" or "x" charactors (a normal image filename will never contain those letters). BarryHunter (talk) 16:37, 5 March 2011 (UTC)
Also medium sized thumbnails? like File:Ottendorf Green.JPG. - Category:Images of the Geograph British Isles project requiring attention seems a good place to put these Oxyman (talk) 00:39, 19 March 2011 (UTC)
Looks like that one was manually uploaded by Northmetpit, given the wrong licensing tag and missing the Geo-data. Keith D (talk) 12:02, 19 March 2011 (UTC)

Corrupt files[edit]

Please have a look at the following:

--DieBuche (talk) 15:56, 6 March 2011 (UTC)

The old OSGB36 WGS84 problem[edit]

While doing a little categorising of files on the Northumberland coast, it is easy to see some whose geotags are 112 m out. (They are in the sea!) Do we have a OSGBerror category we can Hotcat onto the image so it can be cleaned up by a bot later? Opinions? — Preceding unsigned comment added by ClemRutter (talk • contribs) 2011-03-12T01:09:45 (UTC)

Deletion Request Notification moved to sub page[edit]

To head of the problem of this page being swamped by DRNs, we have created a subpage Commons:Batch uploading/Geograph/Deletion requests tweaked the links and moved existing DRNs.

Disambiguation This page is not for deletion request notifications of files uploaded by the GeographBot.

If you want to view the deletion requests go to: Commons:Batch uploading/Geograph/Deletion requests and watch that page.

--ClemRutter (talk) 10:17, 14 March 2011 (UTC)

Getting the remaining images categorized[edit]

Hi everyone. We have an awful lot of uncategorized Geograph images. So what to do next? The first batches of Geograph images where categorized like this:

  1. Get topic category (Geograph added a keyword to each images, we mapped these to Commons categories)
  2. Get location categories (lat+lon-> location tree like Europe, United Kingdom, England, etc etc)
  3. Intersect the topic category with the location categories
  4. Filter over-categorization

This works pretty well, but sometimes the system contains some errors:

  1. Wrong village: Boundaries between villages aren't very clear so we end up in the village next door
  2. Disambiguation problems: More than one place has the same name, image ends up in the wrong category

For the first problem it was suggested to base the location categorization on a different source than OpenStreetMap and geonames. I don't have time in the next couple of months to do this and I'm not under the impression that anyone else wants to do this so this option is not feasible. At least the image ends up close to the actual location so this shouldn't be too hard too fix over time. The second problem is very hard to tackle before, but easy to fix when the bot is done. Say the bot puts the images in Category:A, but the images should be in Category:A (other A), you just have to put {{Intersect categories|A (other A)|Images from the Geograph British Isles project}} on Category:A and just wait for the bot to clean up. This only works if all Geograph images in Category:A need to be moved (for example when A is in the USA).

What do you think? Should I just fire up the bot again? With the current manual approach it will take forever to get the images categorized. Even worse is that the topic categories from Geograph are not used. Multichill (talk) 11:57, 7 May 2011 (UTC)

The topic categories need importing somehow. I am slowly working through the grid squares as this gets the right location in about half of cases immediately, I got many of the big city categories done quickly but its a struggle for the rural ones. A 1-2-1 mapping of grid square to civil parish exists for many but not all cases.
The proposed fix to the disambiguation problem isn't perfect either - if A is a place in England, and other A is also a place in England it will fail. It would work if some county information can be extracted first. For example, if an image of Luton is in grid TLxxxx, its Luton, Bedfordshire. If its in TQxxxx its Luton, Kent and so on. That ought to give a method for sanity checking.--Nilfanion (talk) 17:54, 7 May 2011 (UTC)
I think we should run the bot and accept some errors. The file contains a template saying that categories needs to be checked so users should know that there is a risk that the categories are not 100 % ok. When a new and improved bot is ready in x months it could probably check the categories of all files that still have a "check categories" on. Meanwhile it is possible to fix manually with "Intersect categories" by working on all Geograph images that is in a category that is within the United Kingdom etc. Files that are categorized within the United Kingdom but in a wrong part of the United Kingdom could perhaps be fixed (semi) manually.
If there are known problems we could perhaps ask the bot to put the images in a category one step higher. Like "England" or "Ireland". --MGA73 (talk) 18:41, 7 May 2011 (UTC)
The most important point to me is retention of the grid reference (or a similarly precise info such at lat/long to 0.01degree) until we know the location is right, preferably as a category so we can use CatScan, Cat-a-Lot and all the other category based tools to fix things. Once that category info is thrown away it becomes a lot harder to work out the true location. And I'd oppose just resuming things as we did before the pause in December, without some measure in place to stop the "pollution" of blatantly incorrect categories like Category:New York or Category:Moscow.
To expand that second bit, my concern is the files never reaching the correct UK/Ireland category. I don't mind if the files temporarily go into non-UK location cats, but we have to find them. The specific cases before were typically the local maintainer finding Geograph images and complaining here. Other maintainers might just remove the misplaced Geograph imagery from "their" category, without attempting to correct and so we lose the locality info completely. IMO the minimum is some sort of error-detecting algorithm which reports somewhere and we can fix manually. We have a good chance of fixing all misplaced files then. Bot error-correction would be even better, to save the manual legwork, but its detection that's the important bit.--Nilfanion (talk) 19:46, 7 May 2011 (UTC)
User:Nilfanion used IMO above, but his oppinion is far from humble. he stops everyone else from progressing because of a very few miscatagirisations, his oppinion is in the minority but it seems it's the only oppinion that counts. There is allways going to be some miscatagorisation in an undertacking of this scale. If the amount of miscatogorisation can be reduced easily then it should be done, but it should not be nesacary to have to rewrite a new bot when there is a perfectly good one. I think the only realistic choice is either run the existing bot possibly with a few small alterations or continue with the mannual process that in all likelyhood will never actually get completed. Id choose the former Oxyman (talk) 02:49, 8 May 2011 (UTC)
I made a query to compare coordinates, you can find the result here. It shouldn't be too hard to build a tool around this to find images with articles where the coordinates are not in the vicinity. This way you can easily hunt down images which ended up in the wrong category. Multichill (talk) 10:37, 8 May 2011 (UTC)
Multiple CatScans can also work, but are very tedious (do them across every country..).
One substantive change I'd suggest is to stop the bot adding city categories to rural imagery - This one had Plymouth incorrectly added because of the OSM query, the Geonames query avoids that
And please keep that grid square info on the page: With it you can use easily the OS maps to work out the final correct location, something that can't be done with WGS84 lat/long info (the process becomes much more complex). I could process 100 images in 1 grid square in less time than it takes for a 5 in a town category.--Nilfanion (talk) 12:59, 8 May 2011 (UTC)
The gridsquare information remains available at the Toolserver (u_multichill_geograph_p) or you can use a copy from http://data.geograph.org.uk/dumps/. Multichill (talk) 13:25, 8 May 2011 (UTC)
Can you just retain it in category format (as it is via {{uncategorized-Geograph}}), its the simplest option as it means all Commons existing category-based tools can be used. I'd say best thing would be to add the category such as Category:Images from the Geograph British Isles project needing categories in grid SX4870 after the bot-generated cats (no point fretting about the name - its still a temp cat), and if its hard-coded Cat-a-Lot can grab it to use.--Nilfanion (talk) 15:44, 8 May 2011 (UTC)
Going back to the distance thing. I hacked up this list (based on query and result). It doesn't parse all coordinates yet, but it gives a good impression. This could be turned into a tool to hunt down images with incorrect categories. Multichill (talk) 12:34, 19 July 2011 (UTC)
There's certainly some potential there. One tweak I'd suggest is if the script thinks the coordinates are 0,0 (because it can't parse the coordinates, the en article has no coordinates or whatever) that it gives an error message instead of computing the distance from Africa to the UK. Another thing to try might be to filter the output: If the error is <100km it is likely to be correct; and >1000km is certainly wrong.--Nilfanion (talk) 00:24, 23 July 2011 (UTC)
Yep, the parser doesn't understand everything like now. This is just a proof of concept. This might be the basis for a useful and user-friendly tool. Multichill (talk) 08:30, 5 August 2011 (UTC)

Date categories[edit]

How difficult would it be to automate the addition of month/year categories e.g. Category:July 2007 in England or Category:2005 in Wales? It seems that most Geograph images come with this information already provided. It's not as high a priority as the more sophisticated categories by location (and even e.g. distinguishing Wales/England and other borders is difficult near the edges, although should be "obvious" for most grid squares) but it is a task that could be undertaken at the same time. TheGrappler (talk) 03:16, 20 June 2011 (UTC)

What's the use of these categories? They will contain thousands of images. How does this help the user? Multichill (talk) 08:37, 20 June 2011 (UTC)
That's what I first thought, skeptically, when I noticed this category schema. But it is a widespread and standardized system used to classify thousands of images, and could be easily implemented for the Geograph uploads, so I think the onus is more on "why not?". We endeavour to make geolocation information both machine-readible and useful for end-users (e.g. by using geolocation templates as well as by-locality categories); it stands to reason we ought to do the same for dates too. This is likely to be become more of an issue as Commons matures: for now we are used to most of our photos being from 2005 to 2011, and therefore being "contemporary". But by 2020, we will be looking back at our 2005 images as to some extent "historical" and ought to distinguish them from our photographs of the late 2010s! This is particularly true for photographs of places which change over time (rather than e.g. photos of animal species which are essentially timeless). In fact the Geograph image set contains quite a lot of images from the 1980s and 1990s (and even some from earlier!) which already raise the question of identifying and classifying their dates.
As for "how does it help the user?" ... I've become less skeptical since taking a look at e.g. Category:1983 in New York City or Category:1973 in England. There's obviously potential here: the by-date categories contain a fascinating record of change. In the long run I expect category union and intersection tools will provide the best way to browse by both date and location (e.g. to view "images of Lincolnshire between 1995 and 1999") so the number of files need not be an impediment to helpfulness for end-users. Alternatively editors may choose to split the categories such as "2004 in England" by creating subcategories as date-locality intersections (I'd suggest at a county level to start with; in some cases e.g. Category:2002 in Somerset this work is already underway). But the first step to getting there, is to categorize the images at a basic level: judging from the current category scheme, "YYYY in England/Wales/etc" will do for images prior to 2000, and "MONTH YYYYY in England/Wales/etc" for images in 2000 and later (so long as month data is available). TheGrappler (talk) 20:15, 20 June 2011 (UTC)
We're hitting the limits of the category system here. See User:Multichill/Next generation categories. The second point (Efficient intersections/searching) would be very nice for this. Multichill (talk) 11:21, 19 July 2011 (UTC)

Change to {{Check categories-Geograph}}[edit]

The check categories template currently refers to Geograph, and to the Geonames and OSM databases. I've mentioned above that these additional tools make significant errors at times, inappropriately assigning rural locations to the nearest city, and assigning a rural picture to the nearest village - when the location is actually part of a different parish.

The tool http://mapit.mysociety.org/ is derived from the OS Boundary-Line datasets and provides a lat/long lookup. The key difference is that the information provided will correctly identify the relevant administrative areas - making it a better match for the category scheme. If it was added to the template (http://mapit.mysociety.org/point/4326/{{{2|}}},{{{1|}}}.html), this would be a more useful database for manual checks than the other two. This makes checking location a lot simpler, as only one click is required to get reliable info.

I could just be bold and add it myself, but I'd want to (a) see if there are objections and (b) bring this to the attention of others trying to categorise this stuff.--Nilfanion (talk) 10:36, 15 July 2011 (UTC)

Great suggestion! I added the link. Someone could also add automated harvesting from this source, but that would probably require setting up a mirror. Multichill (talk) 11:19, 19 July 2011 (UTC)

End of the batch[edit]

I'd like to know what the last file uploaded by GeographBot was. As in, what was the latest Geograph ID? The hard disk was supplied sometime ago, and Geograph has had a lot of contributions since then. Its also worth saying that the recent uploads are likely higher value, due to Geograph allowing higher resolution imagery.

Once we know what the last file was, we can work out what to do with the remainder. Noting what the last one is (both here and on Geograph itself), and encouraging manual upload of the good stuff seems sensible at minimum; as well as considering further batches too. I'm inclined not to grab more files until we process the ones we have got though.--Nilfanion (talk) 21:04, 3 August 2011 (UTC)

My plan is to first get the images categorized by the bot so that we don't have a lot of uncategorized files left. I plan to send the hard disk to the Geograph guys again to get all the new files and the higher resolution versions uploaded since I did the first uploads. Don't expect this to happen anytime soon (maybe in a couple of months). Multichill (talk) 08:27, 5 August 2011 (UTC)
Yep, fair enough on that (I agree with your plan, and have no inclination to rushing). What is the last ID on the hard disk? I'd like to have indication of how far we've got and at what point we should upload from Geograph if we want specific imagery.--Nilfanion (talk) 10:17, 5 August 2011 (UTC)
I think it's 1806567. Multichill (talk) 12:00, 5 August 2011 (UTC)

Category errors - August 2011[edit]

I've noticed a few errors (caused by the usual lack of disambiguation confusion), for instance Category:Hope. Should I add any I see to Commons:Batch uploading/Geograph/Disambiguation problems? I see no point in making value judgements on whether or not "it should be disambiguated", just listing them is important.--Nilfanion (talk) 10:13, 5 August 2011 (UTC)

Non-standard characters in Geograph files[edit]

One major nuisance with Geograph I've noticed is the use of non-standard characters - and how GeographBot mangled them further on upload.

For instance, File:Rose Ash, St Peter’s church - geograph.org.uk - 272396.jpg has an awkard character in the file name and its description - instead of it getting converted to the equivalent ’ or the standard ASCII '. This will affect thousands, probably tens of thousands of files, as a result of this non-standard character. There are also complications caused by use of ! - look at Geograph files in Category:Westward Ho!.

There are probably several problematic characters like these, but I'm not sure how to trace (and fix them).--Nilfanion (talk) 21:32, 20 October 2011 (UTC)