Commons:Batch uploading/Geograph

From Wikimedia Commons, the free media repository
Jump to: navigation, search

History[edit]

In November 2009, it was proposed that a cooperative project should be initiated with Geograph to share their site; as site containing about 1.5 million {{cc-by-sa-2.0}} images of the British Isles. The Isles are divided in 1 km by 1 km named grid squares and the goal of the project is to get at least one photo of every square. 250,397 grid squares, or 75.5% of all squares currently then had an image. The system is well known in the British Isles, and the grid squares on prominently display on all Ordnance Survey maps. Each square is named using the format AA9999, except in Northern Ireland where the format is A9999.

While the extra 1.5 million images were generally appreciated, the automatic catorgorisation generated increasing negative comment which took fun out of the project. Different schemes were tried but in January 2011 the upload was suspended. In February 2011, the upload resumed but this time with only one pseudo-category- a template giving the gridsquare.

Geograph[edit]

In the Village pump Perry Rimmer brought up the suggestion of copying all these files to Commons. Geograph is a site containing about 1.5 million {{cc-by-sa-2.0}} images of the British Isles. The Isles are divided in 1 km by 1 km squares and the goal of the project is to get at least one photo of every square. 250,397 grid squares, or 75.5% of all squares currently have an image. Most of the images we use at the English Wikipedia to illustrate villages in the United Kingdom come from this site. The quality of the images is not that high, but nevertheless this is a very rich resource. Dumps of the databases are available and also torrents containing the files. I will contact the people behind this project if we can make some sort of cooperation project of it. Before I start actually uploading images I want to do several things:

  • Build category trees like Category:Towns and villages in England based on enwp and the list at Geograph
    I build the village/town tree for the UK and Ireland.
    Category trees for subjects (like "bridges") still has to be build
  • Populate these trees with the current images
  • Clean up the current uploads
  • Wait for more disk space to arrive

Imported the database dump at the toolserver. It should be straightforward to extract all information from the database. Categories on the other hand is probably going to be a nice challenge. Found several possible tools

- Multichill (talk) 16:44, 19 November 2009 (UTC) (more to come)

Update December 2009[edit]

I downloaded the first 250.000 images (about 25GB) and the database dump. With these two combined it's quite easy to generate descriptions and filenames. I modeled this after geograph_org2commons as everybody seems to be happy with that. Categorization on the other hand is hard. I take the following approach:

  • Get locations from http://ws.geonames.org/extendedFindNearby
    • I mapped some id's (earth/Europe/countries/counties/etc) to a Commons category in a database. I first take a look in this database
    • If the id is not in the database I'll take a look if I can find a category at Commons with a similar name
    • I know have a bunch of location categories at Commons
  • Get the topic from Geograph
    • I mapped Geograph categories (=imageclasses) to Commons categories. See if the Geograph category is in the database
    • If the Geograph category is not in the database look if you can find a category at Commons with a similar name
  • Combine location and topic: Try to find categories deeper in the tree. So for example not Category:Churches, but Category:Churches in England
  • Filter the categories
    • Follow redirects
    • Remove disambiguation categories
    • Filter out overcategorization

This seems to work alright, but for now I'll add {{Check categories-Geograph}} to the images to be sure. Some issues I expect to encounter:

  • Disambiguation problems. Some images will end up in strange categories because a lot of names aren't properly disambiguated
  • Not properly filtering out overcategorization because the tree is broken. For example we have a lot of Topic in Europe categories, but Topic in the United Kingdom is not a subcategory
  • Some categories will get crowded because the tree for the United Kingdom and Ireland hasn't been build yet

The source is available (work in progress!). Oh, and btw, the usual tricks apply so filenames get cleaned up and no duplicates are uploaded. Multichill (talk) 18:06, 3 December 2009 (UTC)

Did a test upload of 365 files. Feedback is appreciated. Multichill (talk) 23:24, 3 December 2009 (UTC)

This is good news, I think it will bring many interesting images. I looked at the ones here. Just a few points for now:
  1. "in Europe": #398 is in both "Buildings in Europe" and "Buildings in Dorset" (and also "Buildings in England"), possibly because Category:Buildings in Dorset is not in "Buildings in England. #256 is in both "Churches in Europe" and "Churches in Hampshire", despite the later being a subcategory of the former.
    Yes, a lot of trees are incomplete. These will show up. Already fixed a couple. Please fix the tree if you spot problems like this.
    Will do.
  2. Dot: When the image title doesn't include a "." at the end, one needs to be added when combined for the description (samples: #256, #106). If there is one in the title, it doesn't necessarily need to appear in the file name (#38).
    This is something so minor, i'll just keep it this way
    A minor fix, but could look at #106? "A view from East Cliff Beach across to Charmouth and Stonebarrow Hill This image is taken from the last concrete groyne" just just looks odd.
    Oh right, now I understand you. Thought you were talking about the title, but you're talking about the description. Will if. (note: strip(), if not last char . -> add char). Multichill (talk) 17:50, 4 December 2009 (UTC)
  3. Filename: I'd limit the "geograph.org.uk - 38" to something like "GG0000038". I don't see much benefit to use the domain name there. I was going to suggest to add the date to the file name, but for some of the files, this seems to be "unknown". 00000000 could be an option for these. Combined with the number this could be "GG20091204-2113138" for a file of today.
    I modeled the filename after Magnus' tool. I like it because it prevents name collisions and is easy to understand
    The full domain name seems excessive and I think the date should be in there. For the Navy pictures, this allows easy sorting.
  4. #256 is in category:Ibsley instead of Harbridge, but I figured out why.
    Yup, the location tool is not right all the time.
  5. #106 is in Category:Coasts and #36 in Category:Hills. Such general categories might fill up quite quickly.
    For 106 this happend because the location tool didn't return a suitable location (Coasts in the English Channel doesn't sound very suitable to me)
    For 36 an intersection between Category:Hills and Category:Isle of Man should be made. Creating these kind of categories will prevent them main categories from filling up
  6. Template: The layout of {{geograph}} could need some work, but this isn't really related to your upload. I already made a request to remove the interwiki from the template (#iw)
    I would like to have a similar layout as {{Fotothek-License}} or {{KIT-license}}, but less discus that at Template_talk:Geograph
  7. Stray text: Some images still have the "Importing image file" text (e.g. #278)
    Yeah, noticed that too. Something went wrong with the import. Removed it from most of the files.
  8. Headers: As the headers are optional, I think we should drop at least {{int:filedesc}} .
    I like to add them for the non-English speakers
    "Description" is translated too, so "Summary" isn't really needed.
  9. The images seem to be fairly old, maybe it's worth doing a test with more recent ones.
    Despite this list, I think the overall quality of the import is good. It is likely to give quite a lot of categorization to do. -- User:Docu at 06:16, 4 December 2009 (UTC)
    That's right. I started with the oldest images and work my way to the newer images. I'll add some more manual categorization mappings. If trees are build and corrected for the most used categories (like Churches) before I compile my batches it will save a lot of time. I'll make a list of important categories to work on. Multichill (talk) 10:45, 4 December 2009 (UTC)
    I replied above. -- User:Docu at 17:00, 4 December 2009 (UTC)
    At User:Multichill/Geograph/categories I put a list of categories. This is based on the 1.5M files in the database. This covers about 60% of the files. I'm working on raising this to at least 80% (will update the list accordingly). For all these categories the trees should be checked and build. The layers to check:
    Sometimes a topic in Europe category exists (for example Category:Churches in Europe). The country categories should be made a subcategory of this. Generally all the topic by location categories should have two or more parent categories. If not, there's probably something missing. Multichill (talk) 13:27, 4 December 2009 (UTC)
    Given the mere amount of images, it might be worth making county categories for topics that otherwise might not be categorized that way. This until we have subcategories for specific features or structures. Your bot, is it already set to make them? BTW could you add "heading:?" to the coordinates? -- User:Docu at 17:00, 4 December 2009 (UTC)
    Most of the categories in my list are already divided by country. It's more about the lower layers. I don't have a bot to create these categories automagicly.
    Heading is added when it's known, see for example File:Aldershot - Home of the British Army - geograph.org.uk - 177.jpg. Multichill (talk) 17:50, 4 December 2009 (UTC)
    I made a matrix of categories here Commons:Batch uploading/Geograph/cat-matrix. So all red categories should be created? If they have the right in/of :-) --MGA73 (talk) 19:07, 4 December 2009 (UTC)
    Yes, for many probably also the subcategories for counties. Keep in mind that the final count could be easily be 2-4 times the quantity listed. -- User:Docu at 05:46, 5 December 2009 (UTC)
    All done now. Maybe Multichill can get a bot to push excisting images down in the new tree? --MGA73 (talk) 21:56, 7 December 2009 (UTC)
    Or maybe someone else can write it ;-) Maybe something in combination with {{Populate category}}. If the image is in both parent category, move it to the underlying category. Have to think about that. What to do if it has more than 2 parent categories? Etc. Multichill (talk) 17:16, 8 December 2009 (UTC)

<unindent>To make sure they get categorized, I made five empty categories for Llyns (Special:PrefixIndex/Category:Llyns). Would these work for your bot that way? If I redirect them to corresponding lake categories, would that work too? -- User:Docu at 10:55, 6 December 2009 (UTC)

If a en:Llyn is just an other word for Lake shold we not just have Multichill tell the bot that Llyns = Lakes? --MGA73 (talk) 18:09, 7 December 2009 (UTC)
My second question aims at that. Otherwise, I can merge them later. Obviously, I prefer to see them categorized as Llyns rather than not at all. -- User:Docu at 19:37, 7 December 2009 (UTC)
This will work, but it's probably easier to add a database entry so that my bot knows that Llyn means category:Lakes. I already did this for the top categories. Should cover about 80% of the images. Feel like helping to increase this hitrate? Multichill (talk) 17:16, 8 December 2009 (UTC)
If someone wants to help they can work on User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox. --MGA73 (talk) 17:33, 9 December 2009 (UTC)
  • It looks like people start requesting renames for files named similar to the ones used by the bot: #1040807. -- User:Docu at 11:55, 12 December 2009 (UTC)
Yes but look at the reason. If filename is wrong then we can rename. --MGA73 (talk) 14:39, 12 December 2009 (UTC)
I don't disagree on part of the request, but the requestor also added "it contains inappropriate information about the source" and removed " - geograph.org.uk - 1040807". -- User:Docu at 14:53, 12 December 2009 (UTC)

Categorization[edit]

  1. Match location id's to Commons categories. Almost done.
  2. Match geograph topic categories with Commons categories. Working on it, see User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox.
  3. Create topic by location categories. Working on it, see here for a list and here for a matrix
  4. Some geograph to Commons category matches turned out to be somewhat strange. Check and correct this list. List here

Multichill (talk) 23:52, 12 December 2009 (UTC)

For (4): in the list, there are a few matches I don't understand: why does "sea loch" match "lake" rather than "sea lochs"? "Loch" should match "lochs", not "Bodies of water".
Currently there is "Village_sign -> Category:Signs": Is there a way to create just "Category:Village signs in the United Kingdom", etc. to avoid that they go into too general categories?
To avoid problems of the "Churches in Europe" type above, maybe the matching should either work around missing continent links or we should try to run a bot to fix the categories before. I tried using CatScan2 to find such categories, but it seems to time out.
BTW I added a redirect for Bogs. It was missing despite there being "Category:Bogs by country (we would still need to make UK/IE specific categories. Please check if this works. -- User:Docu at 03:55, 13 December 2009 (UTC)
Category:Sea lochs was missing: fixed that. -- User:Docu at 15:09, 13 December 2009 (UTC)
Matches are work in progress. A lot of them have been changed.
I don't want the categories to be too specific either so we have to find a balance.
Looks like I tackled all the Europe categories. If I missed some it's easy to fix (create link, use bot to filter the category). Multichill (talk) 16:32, 13 December 2009 (UTC)
I moved my sandbox to Commons:Batch uploading/Geograph/Sandbox. Better we work there so my own "testing" does not ruin something. --MGA73 (talk) 08:12, 14 December 2009 (UTC)
For Europe/UK, you got almost all of them: I fixed 11 missing ones: list. -- User:Docu at 15:23, 16 December 2009 (UTC)

If someone thinks we should have more categories please leave a note Commons:Batch_uploading/Geograph/cat-matrix#Sub-matrix_for_counties. --MGA73 (talk) 09:48, 17 December 2009 (UTC)

Please note that Category:Trees is a Main category and should not have files added directly to it. Please only use subcategories of it! As it is, I've just been saddled with 95 files which I'll have to recategorise now :-(( Thanks - MPF (talk) 01:59, 31 January 2010 (UTC)
See #Comments on ongoing upload. Multichill (talk) 09:23, 31 January 2010 (UTC)

Progress December 2009[edit]

This table is to keep track of the progress of the upload. All directories are located in /mnt/user-store/geograph/torrents at the toolserver.

Source dir Destination dir Prepared Imported
geograph_vol001_image_0_to_49999/00 geograph_vol001_image_0_to_49999_prepared/00 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC) ✓ Done
geograph_vol005_image_200000_to_249999/24 geograph_vol005_image_200000_to_249999_prepared/24 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC) ✓ Done

Uploading[edit]

I've been told, that new space should be ready this week :-) --MGA73 (talk) 13:00, 25 January 2010 (UTC)

Yeah, new disk space is there. Batches are ready to be imported. Multichill (talk) 19:29, 27 January 2010 (UTC)
Assigned to Progress Bot name Category
Multichill Batches prepared ready to import GeographBot Category:Images from the Geograph British Isles project

Do you have an update on which new batches have been uploaded? This is more exciting than Christmas. --ClemRutter (talk) 18:52, 22 February 2010 (UTC)b

No, not really, image 1 to 250.000 have been uploaded and now we're in the process of getting these images properly categorized. Multichill (talk) 18:44, 24 February 2010 (UTC)


Imprecise Geotags[edit]

And several others all tagged to the same incorrect location. There are several other bot mistagged geograph images in Southampton. I do not know how widespread this is, but it needs to be addressed. 188.222.170.156 17:50, 2 May 2010 (UTC)

The locations are probably just lacking precision. Geograph works by a number of squares on a grid. You can fix this by adjusting the coordinates. -- User:Docu at 18:04, 2 May 2010 (UTC)
Yes and No. Looking at the cigarette factory. the bot has captured the user coordinate- which is measured in OSGB36, then displayed it on the map which uses WSG84. No Helmert conversion has been done so we expect a 112m inaccuracy. Why the others are there is open to question.--ClemRutter (talk) 09:23, 3 May 2010 (UTC)
It would be interesting to know what's in the input used for the upload. -- User:Docu at 10:25, 3 May 2010 (UTC)
See top: these dumps of the database. Multichill (talk) 11:33, 3 May 2010 (UTC)
There are several other incorrectly marked locations in Southampton, some like the above, are all marked to one incorrect location. Here's an example of one on its own though, File:Redbridge Flyover, Southampton - geograph.org.uk - 28728.jpg, bot copied from here. The source states a location of 50°55.2873N 1°28.0549W (translates to DMS 50:55:17.238N 1:28:3.294W), yet the Commons upload translates this to 50:55:11.09N 1:28:4.39W. There needs to be a manual sweep of geograph bot uploads, because right now, it's poisoning the tools that use this information such as map layers. Suitcivil (talk) 21:09, 4 May 2010 (UTC)
Just spotted that something very similar is mentioned above at #Geotags. Suitcivil (talk) 21:16, 4 May 2010 (UTC)
At least in the case of 28728 (Redbridge Flyover) I beleive that is our (Geograph's) fault. Multichill's GeographBot, just copies that lat/long from the dump files mentioned above. Inspired by this thread I went digging and it seems that the Geograph code doesnt update the said lat/long, when an image is relocated within a gridsquare. So the lat/long for the said image is the old location before it was updated via the Geograph site. We've never noticed because its just cached there for the search engine, the actual photo page calculates it live from the easting/northing. The best I can offer to do is create a bot to check the lat/long columns (as used in the dump) are correct, and update it as nesserically. And create a new dump. Will also have the script output a changelog, so it can be used by a another bot to correct the coordinates within Wikimedia. I have no way of estimating how many images will be affected by this yet. (Oh and of course will fix the bug within Geograph !) Sorry all for the confusion! BarryHunter (talk) 23:30, 4 May 2010 (UTC)
I just checked the four images mentioned in the opening to this section, and yes all have been moved within the gridsquare, so will have inaccurate lat/long in the version uploaded by GeographBot BarryHunter (talk) 23:34, 4 May 2010 (UTC)
Look, we have a million and a quarter excellent images- and potentially a million and a quarter inaccurate geotags. The errors are multiple-File:Church Street, Shirley, Southampton - geograph.org.uk - 26621.jpg back converts to SU 39500 13500 giving 100m accuracy (or a hell of a coincidence) but going back to geograph we see that is the subject location. In this case it is also noted as the photographers location- but the previous one I checked the photographer was at a different location! Here, from the description, the photographer was at //location dec|50.9223|-1.4333// or SU 39929 13802, so again different.
Could I propose that we upload all the images as soon as possible but with a new template {{geoglocation}}- this would include all the geo information from geograph, and all the parameters of {{location}}. A bot could then be written that carefully verify the input fields- and when satisfied, copy the correct photographer information- accurately converted from OSGB36 alphanumeric into WGS84 {{location}} or add a textual comment that the object location was XXXX XXXX and the degree of precision. This would also allow these images to be put in a to be checked category. We can't do this at the moment by bot or manually with out refering back to Geograph page, as not all the relevant fields have been copied across.
I do propose a two part solution- as so many images are erroneous that one no longer has confidence any one of them being correct, and the UK Map is being saturated with errors. Once we have got all the data on hand we can discuss the best algorithm to use in order to correctly geotag them. Initially I would be happy if we could generate our own photographers lat/long by running the Helmert transformation on the photographers location grid reference. I have the code here in js [1].
Just as a summary, here are the errors I have found
  • Sloppy tagging. The Geograph user just typed in the wrong reference.
  • Geograph works to 100m precision- we attempt at least 7m precision
  • Geograph tags the object not the photographer
  • Geograph gives the object location in OSGB based alphanumeric grid references and WGS84: we use WGS84
  • Geograph gives the photographer location in OSGB based alphanumeric grid references only: we use WGS84
  • Geograph fills an empty photographer location field by copying in the object location
The good news is the Geograph and I get the same result when we run a Helmert conversion (OSGB36 to WGS84)
The cummulative effect of all these errors is impossible to quantify- but runs from 112m ± 100m through to several kilometres.--ClemRutter (talk) 14:31, 5 May 2010 (UTC)
The gridimage_geo database table, from the geograph dumps (already loaded by Multichill onto the Toolserver), contains the full eastings/northings for photographer location as accurate as geograph has it, many are 10m precision. This is more reliable than the wgs84_lat/long columns already used by GeographBot - due to the bug mentioned above. BarryHunter (talk) 17:08, 5 May 2010 (UTC)
I've corrected the column on geograph, and will recreate the actual dump files shortly. However have put the changelog here: http://data.geograph.org.uk/tmp_fix_log.mysql.gz - which someone can use to create a bot to correct the coordinates. I've put in the old coords too, so it can only replace them if somebody hasnt already updated them on wikimedia. I'd do it myself, but having never created such a bot, would probably just wreak more havoc! 776 images below id 250000 where corrected, so thats a ballpack figure for the number of images affected on wikimedia. (and yes this is still our subject location) BarryHunter (talk) 22:55, 5 May 2010 (UTC)
One comment which is related here. Regarding subject / photo locations: We (Commons) care about the photo location for geolocation purposes, but the subject location for effective categorisation... (So if different, don't throw out the subject location data).--Nilfanion (talk) 23:03, 5 May 2010 (UTC)
I'll put the temp table on the toolserver too and use it to correct the incorrect geotags here. Don't know when. Multichill (talk) 13:19, 22 May 2010 (UTC)
I have just discovered that commons has a {{object location}} tag- that may be useful. I really don't know any more, and I have never seen it in use. --ClemRutter (talk) 22:12, 22 May 2010 (UTC)
Ok, fixing geotags now. Multichill (talk) 15:27, 14 August 2010 (UTC)

Update may 2010[edit]

Hi guys, it's time for an update!

  • I spend some time on categorization. I'm now using the OSM source and the results seem to be much better. I'm thinking about combining both sources, but I'm not sure about that yet (Gwynedd problem would happen again).
  • No news on new batches. No new torrents available yet and no word from the Geograph guys yet. When we're going to upload new batches, the categorization for these new batches should be much better because the category tree is much more extensive now.

Thanks everyone for helping out! Multichill (talk) 11:11, 23 May 2010 (UTC)

I'd be curious to know how many {{Check categories-Geograph}} are still in place.

A few problems I've noticed with the categorisation, which you may want to think about before the next run:

  1. Category:Calstock (in Cornwall) and Category:Bere Ferrers (in Devon) are adjacent and the border between them is very complex. Almost all the files I've seen for that area have identified the location as Calstock, and the county (for subject cats) as Devon. It would be nice if it could get it right some of the time...
  2. The current re-categorisation is (inappropriately) adding city categories for files well outside city boundaries (Example - >5 km from the city boundary).
One thing that would definitely be worth exploiting is the Ordnance Survey OpenData; in particular the Boundary-Line product. When the geolocation of the file is right, correct interpretation of that data would identify the most precise administrative region (typically the parish in England - so village level), which in turn would correctly categorise it (most of the time). If it was only used to county level depth, it would guarantee the Gwynedd issue doesn't recur.--Nilfanion (talk) 11:45, 23 May 2010 (UTC)
I ve been chatting away at the HotCat pages in an attempt to speed up the cat checking process, by making HotCat more template friendly- with little success though. --ClemRutter (talk) 15:28, 23 May 2010 (UTC)
208911 files are tagged with {{Check categories-Geograph}}. I'm thinking about not adding it to new uploads, what do you guys think?
As for the incorrect categories: My bot is as good as it's sources. If the Ordnance Survey OpenData is freely available and of better quality it's sure worth to look into it.
But it's also possible to increase the quality of automatic categorization with the current tools:
  • I have a list of static mappings from geo id's to Commons categories. The full list of id's for the United Kingdom can be found here. I now changed the mapping of 2647716 (Gwynedd) from Category:Gwynedd to Category:Wales to prevent the flooding of the Gwynedd category. I will reread this page for more problematic locations (I remember something with Ohio) and will add them to the static mappings. Of course more static mappings can be added to improve categorization, it's probably useful to have all counties of the UK and Ireland mapped.
  • I have a list of imageclass to category mappings. It's probably worth checking the list of unmapped topics and map some more, but not a lot to gain here.
Multichill (talk) 10:15, 30 May 2010 (UTC)

Improved category intersection[edit]

Hi guys, I updated the bot which works on the categories tagged with {{Intersect categories}}, see Template talk:Intersect categories#Subcategories!. This should really improve our ability the split out crowded categories. Multichill (talk) 12:11, 30 May 2010 (UTC)


The big one[edit]

Earlier this week I got a hard disk containing all the 1,8 million Geograph images + a recent dump of the Geograph database. I'm transfering the files to the toolserver now, I updated my local copy of the Geograph database and changed the bot a bit to reflect some changes. I'm about ready to generate new batches to be imported. Multichill (talk) 17:09, 14 August 2010 (UTC)

As the old military expression goes: "Incoming!". How many batches are you planning and will there be intervals between them? Good luck and tnx again for all your hard work on this project - these images have really transformed our coverage of Wales and Britain at Welsh Wikipedia and I'm sure the same is or will be true with the other Wikipedia editions. Anatiomaros (talk) 18:14, 14 August 2010 (UTC)
I'm starting at image 250.000 and will be working in batches of 10.000 files. These batches are compiled and tarred up so one of the shell users can download them for import. I'll probably just keep one program running depending on the speed. I hope to be doing around one batch a day, but I'm not really sure.
For locations I'm combining the result of the two source (Geonames and OSM). These locations + topics + a lot of filtering gives the final categorization results. A lot of categorization issues have been solved so at least we shouldn't be running into these. If new categorization issues arise we'll just have to tackle them.
Commons:Batch uploading/Geograph/cats to clean up and Commons:Batch uploading/Geograph/top categories will be updated every once in a while to hunt down overflowing categories. Multichill (talk) 18:30, 14 August 2010 (UTC)
One problem with the top category list is the number of X in county categories high on that list. For example, Category:Fields in Devon does not have meaningful subcategories (all there is is one geographic subset), so with the present structure should be incredibly bloated (and needs more precise subject categorisation).
It would be helpful if the lists were split into those that need more precise location categories (such as "X in England") from those that need more precise subject categorisation (such as Category:Cumbria or "X in County").--Nilfanion (talk) 18:56, 14 August 2010 (UTC)
Any news on when "The big one" is happening?86.169.41.141 10:49, 18 October 2010 (UTC)
Commons:Wiki Loves Monuments took a lot of time. I'm currently generating new batches. These will be slowly uploaded in the next couple of weeks. Multichill (talk) 11:00, 18 October 2010 (UTC)
I'm keeping track at Commons:Batch uploading/Geograph/Progress. Multichill (talk) 17:30, 21 October 2010 (UTC)


General feedback[edit]

Sorting Geograph images is a 3 stage proces really:

  1. Initial categorisation at/before upload
  2. Recategorisation by bot
  3. "Final" tweaking by human.

I have been doing extensive work on that "final" stage and can draw some conclusions from what I've seen:

  • In general, the subject categorisation (houses/fields/trees in X) is fine, though additional subject categories may need adding.
  • The location categories for images of cities, towns and villages are usually fine - though city cats are inappropriately added to the surrounding rural areas.
  • However, the error rate for categories of rural imagery is much higher.
  • The issue raised above with mis-categorised files is not that common in terms of raw numbers - the problem is the severity (if it is not localised to the correct neighbourhood its much harder for users with "local" knowledge to sort).

Files can be have an incorrect location cat in three ways: Non-location categories being treated as a location such as Category:Treen, the Melbourne situation and clearly wrong locations within the UK (Two examples: 1 - the village is several km from the photo and there are other villages between them and Category:Corntown - a Welsh village with English content in the cat).

Incidentally when the file is miscategorised, I think the bot is already picking it up. In general, even if the location identified is "wrong" it still adds the correct x in county image - the example I just gave correctly placed it in Category:Moorlands in Devon, even though it applied a Somerset village category. If it merely identified this conflict and just added the Moorlands in Devon cat it would be correctly (if inadequately) categorised.

The error rate for rural imagery is significant, and is a natural result of the "nearest village" algorithm. In my last 100 edits in file space (almost all Geograph checks) ~20% corrected the location. If you bear in mind this is really a problem for rural stuff - if this sample of files is representative then the error rate on rural images may well be closer to 50%. These figures are high enough that I cannot trust the bot categorisation and have to verify manually.

Oh and what I mean by error in this context: The bot identified location is not in the correct civil parish. For example, File:Haws near Butland Wood - geograph.org.uk - 274186.jpg is a view of Modbury parish (as can be verified from OS mapping) but was categorised on upload as being in the adjacent location of Kingston. Its correct to say the image is of Modbury, it is not correct to say its of Kingston.--Nilfanion (talk) 10:42, 23 December 2010 (UTC)


Indefinitely on hold[edit]

Over the last year I spend a lot of time on this project. In this project we got a lot of images, it was a fun thing to do and I got a lot of positive feedback. Over the last couple of months this changed. I got a lot of negative feedback from a small group of people and barely any support from the people who like this project. With this topic we hit rock bottom, this project is no fun for me anymore so I'm waste my time on something else. I hope you're all happy with that. Multichill (talk) 12:30, 24 January 2011 (UTC)

  • Damn... --MGA73 (talk) 19:42, 24 January 2011 (UTC)
...and double damn. Please reconsider, Multichill. I can well understand how you must feel but this is quite possibly one of the most important projects we have ever seen on Commons. I have used many GeographBot images on Welsh Wikipedia and have even created articles I'd probably not have started had I not come across an image here whilst categorising these files. Your work is transforming the Britain and Ireland geo-cats and is greatly appreciated by many of us, here and on the various wikipedia editions. Some problems with categories are just inevitable in a project of this size - and I've dealt with my share of them - but responses like "block the bot!" are not warranted. Between that and the fact that some people seem to be here on a mission to have as many images as possible deleted on the slightest of technicalities, I sometimes wonder what is becoming of Commons these days. Anatiomaros (talk) 20:02, 24 January 2011 (UTC)

Suggestion to resume upload[edit]

I would really like to se the rest of the files uploaded. In my opinion a few wrong categories are acceptable but sadly not all users share this idea.

So I suggest we upload all the files but do not categorize them with a bot. Instead they are places in a category like Category:Uncategorized files from Geograph. Once the files are uploaded users can work on the files manually or perhaps someone can design a bot than can categorize some of the files (once they are on Commons EVERYONE can work on them).

So if you support this idea please add your name below and hopefully we can get Multchill to upload the files.

Discussion moved to Commons_talk:Batch_uploading/Geograph#Suggestion_to_resume_upload.

Categories[edit]

All the images need to be checked to see they are categorised. Currently there are two types of geograph images

Early images[edit]

These are usually correctly tagged. The can be identified by the template

{{Check categories-Geograph|year=2010|month=May|day=30|lat=53.422961|lon=-1.908169|Geographcategory=Moorland}}


Later images[edit]

These have no defined categories. A location, and a subject need to be added. These can be identified by the template {{|Uncategorized-Geograph|gridref=SX6960|year=2011|month=February|day=21}}

Using Cat-a-lot[edit]

Cat-a-lot has a few quirks, but does its job well

  • all Geograph files contain the text gridref in the description. So that finding all Geograph images of the River Brun , type River Brun gridref in the standard search box. This will find all the relevant files and display a thumbnail and a desciption. In theseconf line of the description gridref SD8533 | year 2011 | month February | day 26 or similar. This allows you to do a visual check that it is in one of the correct gridsquare. A River name may belong to several rivers in different counties- check out the Rivers Derwent: Derbyshire,Yorkshire,Cumbria, Durham and Northumberland each have one.
  • with the files selected, open cat-a-lot and typein your target category. If it is not there it will create it for you. Select all your files, and add them to that category. If the category is new, click on the image contained with in it, click on the empty category and use Hot-Cat to add a parent category

Adding categories- going on a wikiwalk[edit]

During the last week I have been spending many pleasant and nostalgic hours walking in the Peak District-- all from the comfort of my computer. I am now at the stage of taking off my boots, having a pint and discussing the many things I have seen.

Geograph photos are brilliant... but back to categorisation and a few rambling thoughts!

Now that's a lot of stuff. To get through the various component questions:
  1. The problem with cities is not related to religion and would probably not align with the diocese boundaries anyway, so that cat-a-lot idea isn't workable IMO. It is a result that the coding for cities is "greedy" and includes a lot of the surrounding countryside. Some cities are much worse than others, Plymouth was exceptionally bad before I fixed it all these were categorised to Plymouth, but most are not of the city.
  2. On the moorland / CP issue: Personally I like everything to be placed to the correct parish (or better), so all hamlet cats are included in their parish categories. This makes the CP a useful way to search and exploit via category intersects. This does mean that in some cases (eg Category:Peter Tavy), countryside stuff overwhelms the few snaps of the village proper. This could be handled by splitting the parish from the village, if this is done Category:Peter Tavy (parish) is better than Category:Peter Tavy CP ("Peter Tavy CP" doesn't exist but the CP called "Peter Tavy" does, but needs then disambiguating from the village), and this is not an artificial construct as it represents a real area. The unparished area of Glossop is an equally well defined area - just don't call it Glossop (parish) because it isn't :)
  3. As for the subject categorisation: Do what seems most appropriate. Create categories if there's an article on WP on the subject, which would benefit from it. Don't bother creating intersection categories unless/until the parents are too bloated, in which case do so. Don't lose sleep over it in any case, otherwise you'll go round in circles for months. In the case of Derbyshire, "Streams in the Peak District" is a natural sub-cat. Think about it in urban terms: At what point would you create Category:Darnley Road, Rochester?
  4. The recent transfers do include the grid square, which is better than nothing. The grid square readily gives the civil parish (or whatever), so potentially provides greater accuracy for the localisation - You could use Cat-a-Lot to move everything in Images from the Geograph British Isles project needing categories in grid SE2933 to Category:Leeds, which is where GeographBot would probably have put them. They would still need subject categories, so shouldn't have the "give me categories" tag removed. The bots were always pretty good with subject categorising, they (nearly) always get the <subject> in <county> category right. Extracting the subject tag from Geograph would help there, and could always be done by bot either on upload or later. Remember, the subject categorisation always benefits from a manual look, as Geograph only records one subject matter, and images may relate to multiple subjects - no bot can add the info when it just isn't there.

Think that covers it :)--Nilfanion (talk) 19:07, 12 February 2011 (UTC)

I think there is enough there to formulate an advice note.

Durham[edit]

Please see Commons:Help_desk#Category:_Durham_England - the bot (Geograph bot) is placing images in "Durham" when it should be "County Durham" .. eg [2] Please fix. Thank you.Imgaril (talk) 18:49, 2 August 2011 (UTC)

Also the bot has been placing files (from North Yorkshire) in Category:Ingleby which should be in Category:Ingleby Greenhow Imgaril (talk) 18:23, 3 August 2011 (UTC)

Problem with geographic categories[edit]

There's a problem with the geographic categories geographbot is uploading to - it keeps adding images to 'nearest habitation' categories whether or not this is useful eg http://commons.wikimedia.org/w/index.php?title=File:Enclosure,_Great_Ayton_Moor_-_geograph.org.uk_-_11403.jpg&action=history - adds to Category:Kildale (the correct geo-feature would be "Great Ayton Moor", and the location is within the civil parish of "Great Ayton" not Kildale . From experience the bot does this a lot - possibly more misses than hits.

Fixing these errors isn't the main problem, the problem is that if a geographical category is made "clean" ie correct members, then the bot can come along and randomly spam incorrect images into the category. As an example see Category:Little Ayton - all UK village categories seem to be full of misplaced images as a result. Here [3] it's added to the category "roads" -there is no road ?

Bot User talk:BotMultichillT also does similar odd or wrong things - eg in the last example about - it - added the file to a parent category ie Category:Hambleton is nearly junk now (the file was already in a sub cat)

Does this still occur? If there has been no fix then the bots really need to use maintenence categories not the visible final categories as they are essentially trashing them.

Here's another example of a category that gets trashed by the bots - and the files added don't show up in Kingston upon Hull's main category - so they don't get seen to be fixed. eg [4] - other categories appear ok, but the "by place" categorisation is badly broken.

Please can you make a fix or stop the activity.Imgaril (talk) 19:39, 5 August 2011 (UTC)

This is constantly happening, I dropped a note on the user page but got no response. I have had to remove categories for the East Riding of Yorkshire from several hundred North Yorkshire images, as far north as Whitby. In many cases the correct category is already on the image and the addition is just causing work. This Hull image I have had to revert twice as the BOT has added two incorrect location categories and also the parent category to the one already on the image. Keith D (talk) 20:01, 5 August 2011 (UTC)
With respect to the image of Gower Park, if its correctly categorized why retain the {{uncategorized-Geograph}} maintenance template at all? I'd personally just remove it once you've got the right basic categories: Most precise location that is sensible + relevant subject in city/county cats. Once those are applied, then the image is not a priority compared to the 1M+ uncategorized images, so removing the template means the bots can move on to others. The "nearest-location" thing has a very high error rate.--Nilfanion (talk) 21:06, 5 August 2011 (UTC)
'hotcat' doesn't always remove the uncategorised tag, and with 100s+ of images to do it's not humanly realistic to manually remove the tag on each on.
I can see a possible solution that would see more workable : Place the Category:Images from the Geograph British Isles project needing categories by grid square subcategories images in the relevant geographic categories, not the individual images.
An example - if Category:Images from the Geograph British Isles project needing categories in grid G6742 had been placed in Category:Drumcliffe that would A. Make the images accessible temporarily B. Reduce the amount of work needed. C. Prevent the mass foul ups that is clear are occurring. and D. Make it easier for humans to manually do the categories. (in my opinion)
My experience sounds almost exactly the same as Nilfanion's in a section above http://commons.wikimedia.org/wiki/Commons:Batch_uploading/Geograph#Using_Cat-a-lot
I'd support Nilfanion's civil parish idea (above) especially for unpopulated regions where it makes most sense
As for the other categorisation that the bots do it seems more useful and less error prone. Though like KeithD I've seen it put some far off images in East Yorkshire - I assume this must be a bug - since geograph usually gets the County right on its own pages.
Would other people support the idea of using the bot to place categories not images from Category:Images from the Geograph British Isles project needing categories by grid square into geographic place categories ? Imgaril (talk) 15:12, 6 August 2011 (UTC)
I've requested the bots be stopped until the error rate is addressed, see Commons:Administrators'_noticeboard#User:BotMultichillT.

Uploads with Magnus' bot: formatting problem (sample)[edit]

It might be worth re-importing the descriptions with GeographBot, see Commons:Bots/Work_requests#photograph_every_grid_square for details. --  Docu  at 08:10, 7 July 2012 (UTC)