Commons:Bots/Requests/EatchaBot 2

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

EatchaBot (talk · contribs) 2[edit]

Operator: Eatcha (talk · contributions · Number of edits · recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Adding language templates to file descriptions by identifying the language.

Automatic or manually assisted: Automatic

Source code in Action: https://repl.it/repls/SoupyPortlyCurrencies (MIT License)

Edit type (e.g. Continuous, daily, one time run): Continuous

Maximum edit rate (e.g. edits per minute): At most 10 per minute, but unlikely to reach.

Bot flag requested: (Y/N): Flagged

Programming language(s): Python

Eatcha (talk) 13:50, 11 January 2020 (UTC)

Discussion[edit]

  • Will do 200 test edits 100 hard* one + 100 easier* ones.
  • By hard here I meant : more crap in the description. Easier = less crap in the description.

-- Eatcha (talk) 14:19, 11 January 2020 (UTC)

Can you share a link to your source code? Can you apply this fix? Multichill (talk) 15:15, 11 January 2020 (UTC)

Could you please elaborate how you detect the language? I did check the German results and there are too many incorrect ones:

  • [1] This is not German and also not a suitable description at all.
a bit of German: de:Hinterriß Achim (talk) 21:29, 11 January 2020 (UTC)
  • [2] This is not German.
  • [3] This is only a German town name, but the text is not German
  • [4] This is not even a text in any language.
  • [5] Definitely English and not German.
  • [6] Not German.
it's Polish Achim (talk) 21:29, 11 January 2020 (UTC)

I have a feeling most of the other languages are incorrect as well. Any more non-english natives that can check? --Schlurcher (talk) 21:10, 11 January 2020 (UTC)

This should be Russian [7] --Schlurcher (talk) 21:18, 11 January 2020 (UTC)
(edit conflict) Uh, I wouldn't dare to code such a difficult task. - There are several misidentified to be Kiswahili that are transliterated Japanese like this one for example. Good luck! --Achim (talk) 21:19, 11 January 2020 (UTC)
this isn't Tagalog but transliterated Ukrainian. --Achim (talk) 21:47, 11 January 2020 (UTC)
this isn't Korean but Chinese
this isn't Lithuanian but Chinese
I think we can stop further checking... --Achim (talk) 22:04, 11 January 2020 (UTC)
I was using a data set that turned out to be crap, I will now try using better open-data sets. This was just test run 1, I will try to improve. Many alternatives such as stop-word the numbers/symbols and ignore small texts. Thanks for catching the mistakes, I was using G-translate to check for errors. -- Eatcha (talk) 04:10, 12 January 2020 (UTC)
I made some vital changes, will test some files. Let's see what happens. -- Eatcha (talk) 10:50, 12 January 2020 (UTC)
Looking at https://pypi.org/project/langdetect/ I would check out the scores (detect_langs) it gives and only use it if it's above a certain threshold. Adding the score to the edit summary helps with debugging. Also, most users have babel boxes these days. You can check if the language you found made sense my comparing it to the babel box of the uploader. Of course that only works with local uploads. Multichill (talk) 11:06, 12 January 2020 (UTC)
[8] you probably want to use re.search and use the same regex to find it and to replace it to prevent these kind of mistakes. So something like:
regex = "({{[Ii]nformation\n\|(?:\s*|)[Dd]escription(?:\s*|)\=)(?:\s*|)((.|\n)*?)(?:^\|)"
Do re.search, if match, get group 2, mangle and test it, if all is ok, do a re.sub on the regex with "\\1<your new description>". Multichill (talk) 12:19, 12 January 2020 (UTC)
  • Swahili language is now blocked due to many errors (transliterated Japanese addresses). It's the only problem I can detect right now. Native speakers of sw : Almost 150 million (2012), not a big deal. -- Eatcha (talk) 15:01, 12 January 2020 (UTC)
Hi. Thanks for looking into the first round of comments. Unfortunately, I have to say that there are still a lot of problems with these language assignments. I have checked the last 20 edits and found 6 errors:
  • [9], bot seems to fail if there are Turkish town names with an English description of the country. The total string seems to be English.
  • [10] The bot seems to fail if there is more than one language. The given confidence by the bot seems meeningless if this gets 0.99 for English.
  • [11] Slovakan name with English country seems to fool the bot as well.
  • [12] Even long text does not seem to help, as the english text is not identified.
  • [13] [14] Spanish town with English country also gets wrong language.
I like this task a lot and I had this task also on my agenda for my bot, but did not yet have time to look into this or start any coding. I'm happy to share some of my innitial thoughts in case they are helpful.
  • The task should be restricted to descriptions that do not include new lines (\n or <p>). Often people tend to place different languages into different lines.
  • I think reliable language detection needs more characters to work properly. I would suggest the SMS text length of 140 as a minimum.
  • Any text in brackets, with wiki formating or in "" should be ignored in the above rules, as these tend to be descriptions in other languages.
Did you also look into the Google or Microsoft language detection APIs? The python one seems to give unrealistic percentages and the code base is from 2010 with last update in 2014. Most of the development of natural lanuage processing happend after 2014. --Schlurcher (talk) 08:43, 13 January 2020 (UTC)
I am cross checking with Google cloud translation.
  • 9 is up-to standards as per google cloud. ✔️
  • 10 is not up-to standards. per google cloud. ❌ I do not have access to Microsoft's API but there website https://www.bing.com/translator (which depends on their API) classify it as English. The language with more words gets preference. According to google it's Latvian, it's also not true. These cases are always hard to handle.
  • 11 is up-to standards as per google cloud. ✔️
  • 12 Not up-to standards. But super easy to fix. ❌
  • 13 is up-to standards as per google cloud. ✔️
  • 14 is up-to standards as per google cloud. ✔️
By up-to standard I mean : You would get the same result using google-cloud-translation API.
I am ignoring texts that contains "{{","}}" and "|" and removing craps after calculating difference in scores after applying normalization with google-cloud-translation API results, it's improving with edits. I will also (within 8 hours) add a Unicode based detection for more reliability . -- Eatcha (talk) 09:23, 13 January 2020 (UTC)
Where it beats Google: here , here here
-- Eatcha (talk)09:37, 13 January 2020 (UTC)
Please elaborate why [15] is a good edit in support of a case where your language detection beats Google --Schlurcher (talk) 10:32, 14 January 2020 (UTC)
FYI :Following is google's response for the text "Bung Ta Lua Water Park". It's Romanian according to them, my bot classified it as Bahasa Indonesia Which is better than google considering where the park is located. I am not saying that I will not fix it or it's correct but just pointing out accuracy compared to the tech giant's API.
{
  "data": {
    "detections": [
      [
        {
          "confidence": 0.9281739592552185,
          "isReliable": false,
          "language": "ro"
        }
      ]
    ]
  }
}
-- Eatcha (talk) 10:44, 14 January 2020 (UTC)
  • About names of cities/humans: It's fine if it detects Schlurcher as German, because it is a German name thus automatically a German word. I do not expect it to be English. -- Eatcha (talk) 09:52, 13 January 2020 (UTC)
  • Now link to this page with summary "Report errors here", will remove that after approval. -- Eatcha (talk) 06:34, 14 January 2020 (UTC) 06:33, 14 January 2020 (UTC)
    @Eatcha:. Please stop you bot immediately. The bot has now performed almost 2'000 edits. Please give others time to review this task. Could you please elaborate which further changes were implemented in this new bot run to address the concerns that I have raised above. The bot still seems to mark a lot of files inappropriately. Please also comment on this edit [16]. --Schlurcher (talk) 10:23, 14 January 2020 (UTC)
  • There seems to be a bug with this edit: [17] --Schlurcher (talk) 10:41, 14 January 2020 (UTC)
Schlurcher Check out https://repl.it/repls/DistantWiseJavabeans . Any Idea what's wrong with the code ? Works fine with all other files, but that file ... -- Eatcha (talk) 13:46, 14 January 2020 (UTC)
I did look at the code yet. My best guess at the moment is that it is due to the [ in the description. --Schlurcher (talk) 17:15, 14 January 2020 (UTC)
You are right, about the square brackets. -- Eatcha (talk) 17:28, 14 January 2020 (UTC)
  • Schlurcher Accoding to you this should have been English ? I can make-sure that the bot adds English, I just think that Reinstädt is a German word because it's a German municipality . I can also skip it. -- Eatcha (talk) 10:53, 14 January 2020 (UTC)
I think we have a different understanding on these language tags. I will summarize my thoughts once I have some time. We should also wait for further people to look at these edits. --Schlurcher (talk) 11:03, 14 January 2020 (UTC)
IMHO if any text predicted to be de has Deutschland in the text itself than it should be de. If it contains Germany most probably it's just a place/name in German and the string as whole is in English. -- Eatcha (talk) 11:30, 14 January 2020 (UTC)
I have one suggestion with regard to zh languages. @Eatcha: can you please add only zh-hans for simplified chinese and zh-hant for traditional chinese? That is, zh-cn/sg/my should be zh-hans instead. zh-tw/hk/mo should be zh-hant. Reason is it generally makes more sense to distinguish between simp. and trad. chinese rather than regional variations.--Roy17 (talk) 23:36, 14 January 2020 (UTC)
zh-cn/sg/my should be zh-hans (S at end = simplified )
zh-tw/hk/mo should be zh-hant (T at end = traditional )
  • The bot could usefully quote some or all of the description it's tagging in the edit summary, since that would make it easier to spot its mistakes without having to examine every edit individually. For Panoramio uploads, the filename is often the same as the description, but that won't be the case for a lot of other files. --bjh21 (talk) 12:34, 15 January 2020 (UTC)
Seeing the descriptions in the summary is very helpful. Could you please decrease the prediction precision to like 3 digits. The remaining just take space. --Schlurcher (talk) 08:32, 16 January 2020 (UTC)
I agree. prediction precision up to 3 digits after decimal point, like 0.987, is good enough.--Roy17 (talk) 23:09, 17 January 2020 (UTC)
Symbol support vote.svg Support great bot, which solves a persistent problem. Many thanks for Eatcha putting this together.--Roy17 (talk) 23:09, 17 January 2020 (UTC)

Errors below this line.[edit]

  • Error: in File:013 24 Strečno, Slovakia - panoramio (15).jpg, "013 24 Strečno, Slovakia" was tagged as Slovenian (sl). I think it's English (en) because "Slovakia" is the English name of the country. In Slovak (sk) it would be something like "Slovensko", while in Slovenian it would be "Slovaška". --bjh21 (talk) 13:26, 14 January 2020 (UTC)
✔️ Fixed , today. Any slovak language with Slovakia should be English . -- Eatcha (talk) 05:19, 15 January 2020 (UTC)
✔️ Fixed -- Eatcha (talk) 05:20, 15 January 2020 (UTC)
✔️ Fixed this, today. But it's TOTALLY wrong to call these errors (Try Google to check my statement). For example a text "at Екатеринбург" IS clearly Russian not English just because we have at the beginning. -- Eatcha (talk) 05:24, 15 January 2020 (UTC)
"At Екатеринбург" is a different case. There, there are distinct Russian ("Екатеринбург") and English ("Yekaterinburg") names for the city. "At Екатеринбург" is in a mixture of Russian and English and tagging it as either would be incorrect. I was talking about cases where English uses the native version of a name. For instance, "Osnabrück, Germany" should be tagged as English, not German. --bjh21 (talk) 12:22, 15 January 2020 (UTC)
  • ❌ I can not do anything for typos , I can just add it to black list. -- Eatcha (talk) 04:58, 15 January 2020 (UTC)
  • Error: in File:- panoramio (3626).jpg, "Akgümüş Plain from the Dumanlı Mts, Bunduk, Kahramanmaraş, Turkey. Abies cilicica in foreground." was tagged as Turkish (tr). However it's in English (en) with a lot of Turkish (and one Latin) names in it. --bjh21 (talk) 13:51, 14 January 2020 (UTC)
  • ✔️ Fixed this, today. -- Eatcha (talk) 05:32, 15 January 2020 (UTC)
  • Error: in File:"Wasserburg Lüftelburg" by Niederkasseler - panoramio.jpg, the only part of '"Wasserburg Lüftelburg" by Niederkasseler' that's not a proper noun is "by", which is English (en). However the bot tagged it as German (de). --bjh21 (talk) 13:51, 14 January 2020 (UTC)
  • ❌ Check my statement about Yekaterinburg. West Germanic languages use what we call english alphabets but what matters is what text is present in significant amount. There are no characters such as umlaut and Eszett in English . -- Eatcha (talk) 05:31, 15 January 2020 (UTC)
  • File:02.11.2008. Peggau A - panoramio.jpg: An Austrian place-name, "Peggau", got tagged as Indonesian (id). Maybe it's German (de)? --bjh21 (talk) 17:12, 14 January 2020 (UTC)
Added to block list. -- Eatcha (talk) 05:16, 15 January 2020 (UTC)
File:- panoramio (7886).jpg not german. more like english.
❌ is nothing in English. most probably Nauchniy is typo of Nauchny --- Eatcha (talk) 05:14, 15 January 2020 (UTC)
File:001103高岡古城公園 - panoramio.jpg could be japanese and chinese (all are kanji anyway, but since the place is in japan and judging the author's profile technically it's japanese) not korean.
✔️ Fixed , today -- Eatcha (talk) 05:13, 15 January 2020 (UTC)
File:034 皇居桜田門 - panoramio.jpg japanese not chinese.
✔️ Fixed , today. -- -- Eatcha (talk) 05:09, 15 January 2020 (UTC)
File:2 Chome Kotobuki, Abiko-shi, Chiba-ken 270-1152, Japan - panoramio (5).jpg transliterated japanese not indonesian.
✔️ Fixed , on 13th . -- Eatcha (talk) 05:08, 15 January 2020 (UTC)
File:2010-10-17 吉野発電所水路 - panoramio.jpg japanese not korean.
✔️ Fixed , today. -- Eatcha (talk) 05:08, 15 January 2020 (UTC)
File:2010-6-24 吉野三郎 - panoramio (1).jpg japanese (could also be chinese) not korean.
✔️ Fixed , on 13th . Getting zh -- Eatcha (talk) 05:04, 15 January 2020 (UTC)
File:2014-04-04 石象湖 郁金香 liuzusai - panoramio (39).jpg chinese.
✔️ Fixed , on 13th . -- Eatcha (talk) 04:59, 15 January 2020 (UTC)
File:2014鹿港慶端陽 文開國小管絃樂表演 - panoramio.jpg zh-hant not korean.
✔️ Fixed , on 13th . -- -- Eatcha (talk) 05:00, 15 January 2020 (UTC)
File:226, Taiwan, 新北市平溪區十分里 - panoramio (13).jpg zh-hant.--Roy17 (talk) 23:15, 14 January 2020 (UTC)
✔️ Fixed , on 13th . -- Eatcha (talk) 04:53, 15 January 2020 (UTC)

New Errors Here (Added errors that occurred after 13th January ONLY ). Bot has improved since the first run.[edit]

I did review the 16th January ones. The bot definitively improved :-). --Schlurcher (talk) 09:37, 16 January 2020 (UTC)

  • [18], bot did categorize as fr despite higher prediction for en and multiple English words identified. --Schlurcher (talk) 08:29, 16 January 2020 (UTC) ✔️ Fixed
  • [19], seems to be en --Schlurcher (talk) 09:37, 16 January 2020 (UTC)
  • [20]. Seems right. But I am wondering, do you have a lower limit for the prediction level, or will always the most likely language be assigned? --Schlurcher (talk) 09:37, 16 January 2020 (UTC) Answer : 3 Mechanisms are working here simultaneously to yield the best detection result. For Japanese and korean 4 mechanisms and 5 for Chinese. The detection rate is just one of the 3 things working here. ✔️ This is an English sentence about an Islamic Indian monument. -- Eatcha (talk) 18:57, 16 January 2020 (UTC)
  • [21], seems to be en --Schlurcher (talk) 09:37, 16 January 2020 (UTC) ✔️ Fixed
  • [22], seems to be en --Schlurcher (talk) 09:37, 16 January 2020 (UTC)
  • [23], seems to be a mix. An example, why I would still suggest to remove files where the description is more than one line long. --Schlurcher (talk) 09:37, 16 January 2020 (UTC)

A few more errors:

Thank you for putting the description in the edit summary: it made checking much easier. --bjh21 (talk) 12:21, 16 January 2020 (UTC)

File:近鉄大阪線 河内国分駅 Kawachi-Kokubu station 2012.10.09 - panoramio (1).jpg ja not zh. clue: 鉄 駅 are japanese kanji, which are rarely used in zh. ✔️ Fixed
File:名柄郵便局(御所市) Nagara Post Office 2012.4.07 - panoramio.jpg ja not zh. clue: 郵便局 is a japanese only word. ✔️ Fixed
File:坪川歯科医院 - panoramio.jpg ja not zh. 歯 is japanese kanji. zh-hant is 齒. zh-hans is 齿. ✔️ Fixed
File:岡山駅前の路面電車 by takeokahp - panoramio.jpg ja not en. ✔️ Fixed
File:松木島駅跡 - panoramio.jpg ja not zh. ✔️ Fixed
File:県道63号線 - panoramio (1).jpg ja not zh. 県 ja kanji. zh-hant=縣. zh-hans=县. ✔️ Fixed
File:阿寺城 三重堀切 - panoramio (2).jpg ja not zh. 堀切 ja only word. (✔️ Fixed_
--Roy17 (talk) 23:09, 17 January 2020 (UTC)

Is this desired behavior? It's right about the words being English, but its just a scraped title from Flickr (complete with ending with "06" because it was part of a sequence I shot and uploaded there), do we really want to call that out as an English-language description? It was a title, but not really a description. If that's desired behavior, fine. - Jmabel ! talk

Yes, it is. After classifying file description, it should a bit more easier to automatically categorize files. It should be more easier for anyone who wants to operate with files in a particular language. Anyone can fetch these files with API, if no language template is used It gets a lot more messy.-- Eatcha (talk) 10:46, 18 January 2020 (UTC)

Final goal with the bot[edit]

Hi, in addition to the technical discussion above, I think we also need a meta discussion on the overall aim and expectation of the bot. I think there are 3 possible routes this can go:

1. We expect the bot to perform 100% correct edits and only approve a bot that does so
This seems unrealistic, as I now understand that the issue is not primarily with the python language detection. It seems that language detection in general is not up to this standard yet. There are simply to many special cases. So this expectation will likely lead to a rejection of this proposal.
2. We expect the bot to almost perform 100% correct edits and only approve a bot that does so
This would mean that we should have a discussion on under which cases the bot performs a correct detection. This would lead to a reduction in the processed pages and ultimately will lead to a potentially small subset identified.
3. We expect the bot perform language detection as best as possible, but it is expected that the bot will do errors.
This would mean that the bot tries to identify the language for each page (where it is missing) and that we expect user to correct this afterwards, if it was incorrect.

It currently seems that the bot is working under the assumption 3. Do we feel this is the correct way? -- Schlurcher (talk) 19:25, 18 January 2020 (UTC)

Expecting that the user who uploaded the image will correct errors is in my opinion extremely optimistic because I expect that in most cases he/she will not see the watchlist or does not care. It is nice to have many corrections by the bot, but how important is it to have it all correct. I see very often that people have used English as default and apparently don’t care to change it when the description is for example in Spanish. Wouter (talk) 21:59, 18 January 2020 (UTC)
  • Can we please use Precision = TP / (TP + FP) to determine usefulness. Why not try to with latest 1000 edits. perform 100% correct/almost perform 100%/as best as possible are not ideal way to get the Precision rate. After calculating the rate we can set a target to achieve. Try to find FPs after 22:23, 18 January 2020(the thousandth edit from the newest). -- Eatcha (talk) 04:27, 19 January 2020 (UTC)

Try your best to find errors after 22:23, 18 January 2020. We will plug FP into Precision = (TP) / (TP + FP) to determine Precision of model[edit]

# x marked as y ~~~~

  1. pt marked as it -- Eatcha (talk) 04:32, 19 January 2020 (UTC) -- Eatcha (talk) 04:39, 19 January 2020 (UTC)
  2. en marked at lt -- Eatcha (talk) 04:44, 19 January 2020 (UTC)
  3. en marked as id -- Eatcha (talk) 04:46, 19 January 2020 (UTC)
  4. en marked as nl -- Eatcha (talk) 04:49, 19 January 2020 (UTC)
  5. en marked as it -- Eatcha (talk) 04:50, 19 January 2020 (UTC)
  6. English and Korean mix marked Korean -- Eatcha (talk) 04:53, 19 January 2020 (UTC)
  7. en marked as tr -- Eatcha (talk) 04:56, 19 January 2020 (UTC)
  8. en marked as it -- Eatcha (talk) 05:01, 19 January 2020 (UTC)
  9. fa marked as ar 4nn1l2 (talk) 14:22, 19 January 2020 (UTC)
  10. fa marked as ar 4nn1l2 (talk) 14:22, 19 January 2020 (UTC)
  11. en marked as de --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  12. en marked as de --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  13. es? marked as en --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  14. zh? marked as en --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  15. zh? marked as en --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  16. ko mix with en marked as en --Schlurcher (talk) 18:58, 26 January 2020 (UTC)
  17. zh? mix with en marked as zh --Schlurcher (talk) 18:58, 26 January 2020 (UTC)

discussion of the probable false positives below this line[edit]

no 1: wrong, Google translate gives Portugese
no 2 if it was “lt” Lithuania would have been Lietuva, so “en” would have been correct when the name of the location was spelled in local language
no 3 as these are names of locations this could have been many other languages. Google translate gives Hindi
no 4 as with no 3, this the name of a location it could have been many languages. As USA is mentioned I prefer “en”
no 5 wrong. France recognized as English is correct but it is also French. If it was Italian it would have been La Tronche, Francia
no 6 OK
no 7 a difficult one. If it was “tr” it should have ben Kuchmin Yar, Kiev, Ukrayna
no 8 the name of an automobile. It could have been many languages.
Wouter (talk) 14:06, 19 January 2020 (UTC)

As I mentioned above in a discussion with another user, I can prevent all the cases where there is no translation of a place, but the country is in English like I did with Germany and Spain. I also will fix multiple languages problem if both languages are used in significant amount. Let's say 70:30 ratio of zh to en. If the ratio is 90:10 or 80:20 I will Mark the one with 90 and 80 respectively. I today created a dataset of plant genera to help the bot classify these as English. As a user Linked the problem with fa and Arabic, text I will fix this problem before the next run. My hopes are to achieve more than 0.95 accuracy in the later runs. -- Eatcha (talk) 15:33, 19 January 2020 (UTC)
I did check around 200 edits now and saw only 8 errors (potentially a few more that I cannot identify). I would say the accuracy is already around 0.95. --Schlurcher (talk) 19:01, 26 January 2020 (UTC)