Commons talk:Chinese characters decomposition

From Wikimedia Commons, the free media repository
Jump to: navigation, search

Some Suggestion[edit]

This project is really interesting, I would say it is mandatory for improving learning methods. I think there is nothing like this work at a collaborative level! For this reason I guess it is of encyclopedic interest. Am I wrong? Follows some suggestion. Please comment. --Artsakenos (talk) 14:52, 17 February 2011 (UTC)

If you want me to answer something, please leave a comment on my talk page - I don't check this page too often. Michelet-密是力 (talk) 11:47, 27 February 2011 (UTC)

Unicode typeset[edit]

1) In the Unicode typeset there are some slots hosting radical variants and special components (see UniHan Page).

E.g., both these character 左 left, 右 right, 友 you contain a hand as compound (ancient meaning) "𠂇" existing in unicode (see e.g., YellowBridge as well as any other etymological source, and note that you can't see the character in some old version of windows). --Artsakenos (talk) 14:52, 17 February 2011 (UTC)

Since the purpose of this work is to provide (where possible) the reason why that compounds have been used we should refer to etymological sources to check, and exploit the whole Unicode character sets.

The decomposition is not meant to be etymological. These character decompositions are intended to be purely graphic ones. If a character is said to be composed of two simpler characters, it can theoretically be drawn by superposing the corresponding two simpler characters.
  • Character decomposition reflect etymology (historical composition) most of the time, but not always ; there has been historical variations. Variants are grapĥically different, so the decomposition should reflect that : there is not necessarily graphical derivation between these variants.
  • Graphical etymology for interesting characters is a work I have on progress on fr:wikt:Catégorie:Étymologie graphique en chinois, you may want to give a look. Most of the time, for "interesting" characters, the historical etymology has little to do with the character compositions.
  • YellowBridge is fine most of the time (for easy cases), but not always accurate, and gives "decompositions", not etymologies. Check the graphical etymology with Richard Sear's site to see the difference.
  • The 𠂇 character did not exist in the unicode subset I used, this is why it is not in the initial file. Feel free to add a decomposition when relevant.
Michelet-密是力 (talk) 17:27, 26 February 2011 (UTC)

Unicode typeset[edit]

Notes on the non-used Unicode dataset:

  • Unicode subset from 9fa6 (40870) to 9fbb (40891) is missing.
  • Unicode subset from 3400 (13312) 㐀 to 4DB5 (19893) 䶵 is missing.
  • Unicode subset from E815 (59413)  to E864 (59492)  is missing.

--Artsakenos (talk) 14:52, 17 February 2011 (UTC)

Indeed, the work was started when onlythe first unihan subset existed. Feel free to add complements, of course. Michelet-密是力 (talk) 17:27, 26 February 2011 (UTC)

ISO 10646 Codification for compound kind[edit]

2) There is already a complete codification reference provided by ISO10646 (see ISO10646) to describe the compound typology (see the publication Dec. for ISO/IEC 10646 Ideographic Characters. We could refer to that standard to describe composition typology.

It already handle even the three characters and other special compounds. I'm reporting that here:

Smbl       Code p. Name in ISO 10646               Cardinality     Label
吅       2FF0    IDC LEFT TO RIGHT               IDC2            A
吕       2FF1    IDC ABOVE TO BELOW              IDC2            B
罒       2FF2    IDC LEFT TO MIDDLE AND RIGHT    IDC3            K
目       2FF3    IDC ABOVE TO MIDDLE AND BELOW   IDC3            L
回       2FF4    IDC FULL SURROUND               IDC2            I
冂       2FF5    IDC SURROUND FROM ABOVE         IDC2            F
凵       2FF6    IDC SURROUND FROM BELOW         IDC2            G
匚       2FF7    IDC SURROUND FROM LEFT          IDC2            H
厂       2FF8    IDC SURROUND FROM UPPER LEFT    IDC2            D
勹       2FF9    IDC SURROUND FROM UPPER RIGHT   IDC2            C
匕       2FFA    IDC SURROUND FROM LOWER LEFT    IDC2            E
.       2FFB    IDC OVERLAID                    IDC2            J
--- Some more code taken from the old ones
一               Graphical primitive, non composition (second character is always *)
咒               Vertical composition, the top part being a repetition.
弼               Horizontal composition of three, the third being the repetition of the first.
品               Repetition of three.
叕               Repetition of four.
冖               Vertical composition, separated by "冖".
?               Unclear, seems compound but ...

Of course, this is a pretty huge work and could be not necessary, but right now it is just a proposal to follow some existing standard. --Artsakenos (talk) 14:52, 17 February 2011 (UTC)

I was aware of the standard, but disagreed with it ; hence the typology I adopted. To my knowledge, there is no "three characters composition" like 罒 or 目, the composition is (nearly) always a 2+1 one. And the "surround" kind is given by the surrounding character, there is no need to state it bu a separate code. And the 冖 composition is not identified (actually it is the only one to be of a true 目 kind). And... But, of course, translating from my typology to the unicode standard is rather easy. Michelet-密是力 (talk) 17:31, 26 February 2011 (UTC)

Pictophonetics tags[edit]

See the others users comments about this (Decomposition rules, possible erroneous decomposition)

船       11      吅       舟       6       八口

The decomposition should be 几口 not 八口 or, referring to etymological dictionaries: 㕣 (which don't appear, why?), which would be further decomposed! Furthermore, this is a Pictophonetic decomposition. Where the etymological decomposition doesn't exists, this _could_ be said with a simple tag. A suggestion is to exploit the etymological dictionaries, which often already provide this information. --Artsakenos (talk) 14:52, 17 February 2011 (UTC)

In that case, I could not find the 㕣 character in the unicode set I used (it is in a more recent character set), so I put the superposition as a placeholder (and, indeed, the decomposition of that character should be 几口). Feel free to correct the decomposition. Michelet-密是力 (talk) 11:47, 27 February 2011 (UTC)

Chinese character decomposition license[edit]

What is the exact license of the Chinese characters decomposition data?

--DV 17:59, 13 December 2008 (UTC)

"By submitting text contributions, you irrevocably agree to release all rights to your text contributions under the terms of the GFDL. " - so it's GFDL. Michelet-密是力 (talk) 19:26, 16 December 2008 (UTC)
The above comment that this is a GFDL license is from 2008. I wonder if that has changed at all? When I see the footer of the page now says: "you irrevocably agree to release your contribution under the Creative Commons Attribution-ShareAlike 3.0 license and the GFDL." -- so I take that to mean Creative Commons Attribution-ShareAlike 3.0 license is now supported?

Decomposition Rules[edit]

There're many ways to decomposite a Chinese character, we should choose the correct one. Which one is correct? I think it's the one which shows the character's meaning and how it is created. Therefore, the best way is to refer to "Shuo Wen Jie Zi"(說文解字) and "Kangxi Zidian"(康熙字典).

For example, means "as big as a lizard", and should be decomposited to "虫" and "唯", where 虫 is the Radical, and 唯 is the pronounciation. Another examples, "" should be decomposited to "臥" and "品". "發" to "弓" and "癹".

Rules should be set to get rid of misdecompositions. --Wihwang (talk) 10:58, 4 February 2009 (UTC)

This might depend on the purpose of decomposition data. For instance, a computer program to generate stroke orders or graphical glyphs for Chinese characters would work properly for 雖 only using the graphical decomposition into 虽 and 隹. Perhaps it is necessary to distinguish, for some characters, between an etymological decomposition and a graphical one. -- Babelfish (talk) 10:42, 21 February 2010 (UTC)

Possible erroneous decompositions[edit]

I am not a specialist on this, but I happened to notice that a few of the decompositions I looked at seem not to agree with the analyses given at http://www.chineseetymology.org/ and thought this might be worth reporting:

  • 兹 is not a repetition of 玄, but rather a combination of 艹 (艸) and 幺;
  • 昌 is made up from 日 and 曰 rather than 日 twice;
  • 朋 is not derived from 月;
  • 祘 (abacus) is not described as deriving from 示 (altar).
Answers - please keep in mind that the file gives character decompositions, not etymologies!
  • 兹 : You are probably right, the repetition seems artificial ; but I couldn't find an example of repeated 幺 in the first unihan character set.
  • 昌 : You are right once again, though the difference is etymolgical : graphically it's hard to make a difference.
  • 朋 : The etymology is indeed a separate one, but the character is now composed as indicated, by a repetition of 月 ; this is what is given in the file.
  • 祘 : The etymology is indeed a separate one, but the character is now a repetition of 示.
Michelet-密是力 (talk) 12:12, 27 February 2011 (UTC)

Raw Version[edit]

Is there a raw txt version of this project which can be directly computed by a parsing script or we need to copy/past each line ?

-- I'm trying the same, and I think it's enough to extract exactly the lines between <pre> and </pre> tags. -- Babelfish (talk) 13:57, 6 February 2010 (UTC)
-- seems there's also another free project of characters decomposition as reporting it here, maybe it can be good to contact the author to see if exchange of data is possible in order to avoid duplicate works http://code.google.com/p/cjklib/wiki/Decomposition --Sysko (talk) 21:34, 16 February 2010 (UTC)
=> Yes, of course, the page is designed for easy uploading. Click on the "edit" link for the table (or for the whole page), and copy the source data. Then, sort the lines (alphabetically), so as to single out all section headers and comments, which can be erased = the rest is the raw txt version. BUT, if you need to correct the file, make sure the sub-section structure is respected / maintained, so that other users may have the same references. Thank in advances, Michelet-密是力 (talk) 21:46, 1 March 2011 (UTC)
It is not possible to select the whole table by editing the Table section since there are other sections with the level one heading. This is one of the reasons why I propose to add a new and more comprehensive guide to the page as subpage (as the one in User:Artsakenos/CCD-Guide), if you agree, of course. So as to have a media which is just the Chinese decomposition table. (Artsakenos (talk) 11:07, 5 March 2011 (UTC))
File should be tab-separated but it's spaced (with variable number of spaces), please check below for my csv version if you have problems to parse this file (talk) 5:27, 13 March 2011 (UTC))

Explanation please?[edit]

        備       12      吅       亻       2               萹?      10      ?       OTHB    人
        傷       13      吅       亻       2               2昜      11      ?       OOAH    人
        傻       13      吅       亻       2               <夓   11      ?       OHCE    人

Can anyone explain field7(part2?) there?

  •  ? means that this part is unreliable?
  • 2 means that 2 strokes are missing in 昜?
  • < means that 夓 has too many strokes?

217.25.222.124

The question mark (?) means that the indicated character is doubtful (unreliable). In the first line, for instance, 備 (12 strokes) is a compound character, with an horizontal composition (吅), the left char is 亻 (2 strokes) and the right char looks like 萹 (there should be 10 strokes) but (?) that is doubtful. Actually, since 萹 has 12 strokes, the character is obviously incorrect in that case (戸 instead of ...?). Michelet-密是力 (talk) 15:24, 1 January 2011 (UTC)

Conversion to dictionary file[edit]

First, thanks to all contributors! I created a dictionary file (in stardict format) which can be used in dictionary programs. The source code to generate the file is obtainable from my sourceforge project page. Thanks again everyone. --Benjamin-dkp (talk) 09:56, 12 May 2012 (UTC)

[edit]

Is this a list of etymological decompositions or graphical ones? I have found many decompositions classified as 冖 are etymologically wrong. For example, 螢/瑩/營 are not 炏 + 冖 + 虫/玉/呂 but 𤇾 + 虫/玉/呂, and 党/常/棠 are not 小 + 冖 + 兄/吊/呆 but 尚 + 儿/巾/木 (裳 is correctly listed as 尚 + 衣). That is clear because they have identical or similar pronunciations. In addition, there are characters that should be listed as 回 etymologically. 勝/媵/謄/騰 are not 月 + 劵/关女/誊/駦 but 朕 + 力/女/言/馬, and 旗/旚/旝 are not 方 + 𠂉其/𠂉票/𠂉會 but 㫃 + 其/票/會. In some cases http://jigen.net/ provides good information but the site mainly shows graphical decompositions. — TAKASUGI Shinji (talk) 20:34, 29 June 2012 (UTC)

The project aims at showing graphical decomposition mainly for IT purpose. See this discussion with Michelet-密是力. Your suggestions are correct, feel free to perform updates. --(Artsakenos (talk)) 00:55, 14 September 2012 (UTC)

咒 and 弼[edit]

The classification of 咒 is not necessary, because they are actually top-bottom compositions. For example, 咒 itself is not 口 + 口 + 几 but 吅 + 几; 嬰 is not 貝 + 貝 + 女 but 賏 + 女; and 燹 is not 豕 + 豕 + 火 but 豩 + 火. The same is true of the classification of 弼 because they are outside-inside compositions. For example, 辨/辦/瓣/辮/辯 are not 辛 + 刂/力/瓜/糸/言 + 辛 but 辡 + 刂/力/瓜/糸/言, and 嚻/囂 are not 吕 + 頁 + 吕 but 㗊 + 頁. Thanks. — TAKASUGI Shinji (talk) 04:55, 30 June 2012 (UTC)

Again, I agree with your considerations. Almost all Chinese characters are composed exploiting two existing components/characters, being it left/right, up/down, inside/outside, whatever. Feel free to commit your updates. The reasons most of the decompositions have been made that way are possibly many: they have been produced by some sort of automatic algorithm, Unicode table was still not comprehensive enough to contain all root components, and so on. --(Artsakenos (talk)) 00:34, 14 September 2012 (UTC)

Hanji, kanji and shortcomings of unicode codepoints[edit]

Consider 直 or 将

In shinjitai 将's graphical composition is 丬+寽 instead of 丬+夕寸. Naturally, I'd like to see both forms. What's good way to handle it? Add a row just after existing one? Make new section for shinjitai entries? (Contribution by 109.120.61.158 (unsigned)‎)

The key point is, the decomposition must be graphical not etymological. According to Shinjitai of your example the 寸 has been replaced by 爫. Also note that where possible and Unicode allows, it should be one character -> two components (e.g., left/right, up/down, outside/inside). In this case, AFAIK, there is no Unicode representation for the right part of 将, hence it can be left as 丬+ 夕 + 寸. I don't know if there is some Open Source map Shinjitai - Kyūjitai - Simplified Chinese character. It would be interesting indeed to start one, if not.--(Artsakenos (talk)) 11:40, 21 March 2014 (UTC)

10. KanJi codification (for easy sorting)[edit]

Is this 10. KanJi codification (for easy sorting) actually the Cangjie input method "reinvented"? If yes, don't give it such bogus names as 10. KanJi codification

-- Agree with this ! 118.21.144.199
I checked the page history and that was a really old typo! --(Artsakenos (talk)) 18:03, 29 July 2014 (UTC)

Major revision[edit]

  • The revision has been made according to my main (private) file. It has been compared to the state of the Commons file as of yesterday, and many improvement are taken from that comparaison, but a large number of correction and/or alternative have not been retained in the process.
  • Most of the revision has consisted in adding characters components that are not in the 一 (4e00) to 龥 (9fa5) range.
  • The stroke conts have been systematically checked and should now be OK.
  • 咒 and 弼 patterns : These classifications should be maintained for the time being, unless it can be assured that all composed characters formed on the 咒 and 弼 patterns indeed have an equivalent 吕 or 吅 construction. Insofar as 咒 and 弼 patterns appear more often than what would be expected given the frequency of the proposed alternatives for duplicated componds. A character such as 燹 can be seen as 豕 + 豕 + 火 or 豩 + 火, the "best choice" cannot be based on simple graphical considerations.
  • 冖 pattern : same thing, it could be eliminated if it can be assured that all such characters can have an alternative decomposition; right now it appears as something spécific, which is noted as such.
    • 螢/瑩/營 are not 炏 + 冖 + 虫/玉/呂 but 𤇾 + 虫/玉/呂 : could be, but the 𤇾 character seems very artificial, and that decomposition does not seems to be helpful.
    • 党/常/棠 are not 小 + 冖 + 兄/吊/呆 but 尚 + 儿/巾/木 : Why not... but note that 尚 is deformed and cannot be analysed directly into 小 + 冖 + 口 - so is it a real progress ?
  • 回 pattern has been extended to compounds of 𠆧, 仁, 攸,...

Suggestions :

  • If alternative graphical decompositions are possible, it may be more interesting to duplicate the entrey than to try and choose a "correct" one (?)
  • An alternative file could be prepared to describe the decompositions in small seal script; this would account for most ethymological discussions.

Michelet-密是力 (talk) 02:31, 7 August 2014 (UTC)

Second major revision =[edit]

  • Format errors were corrected (all of them), now tabulation = 4 spaces (always)

Please update from here, I think I've no permisions to do it

Self-contradictory introduction[edit]

The introduction to this page seems to contradict itself. In the first paragraph, it says the purpose is to provide a "purely graphical decomposition" of characters so that they could be derived automatically by superposition. However, in the next paragraph, it seems to suggest some decompositions which could not be used to derive the characters automatically by superposition - e.g. 雖 into 虫 and 唯. It's not made entirely clear whether these are meant to be good or bad examples of what should go into the file - however, it is clear (to me, at least) that a computer program would produce an incorrect result by superposing these two characters. Then it says 'the best way is to refer to "Shuo Wen Jie Zi" (說文解字) and "Kangxi Zidian" (康熙字典)'. Surely, doing this would be virtually guaranteed to produce thousands of decompositions that could not be used to derive their characters automatically - so is a terrible piece of advice for someone editing this file.

As it stands, I get the feeling the first and second paragraphs were written by people with distinctly different aims.

Perhaps there should indeed also be a database for character etymologies. However, in that case, I think there is still also a need for a database that allows a computer program to determine the graphical structure of a character without having to resort to image processing. And, in order for both databases to remain useful, the two should be kept entirely separate, without the possibility of an editor getting the wrong idea about which one they are editing. — Preceding unsigned comment added by Spacemartin (talk • contribs) 12:57, 09 November 2015 (UTC)

-Spacemartin (talk) 12:58, 9 November 2015 (UTC)

Any idea about this character ?[edit]

Is possible to know what is '??' ?

       ??      8       吅       禾       5               勺       3                       禾