unidic-mecab for Debian ----------------------- Quick start: ------------ $ echo "和布蕪は美味しいね" |mecab -Owakati 和布蕪 は 美味しい ね You can drop "-Owakati", replace it with "-Overbose", or replace it with "-Ovchamame" to use unidic-mecab with mecab. About mecab basics: ------------------- For how to use mecab command itself, install the mecab package and read: * https://taku910.github.io/mecab/ For how to manage mecab dictionary, install the mecab-utils package and read: * https://taku910.github.io/mecab/learn Please note that the dictionary utilities need full path to invoke: * /usr/lib/mecab/mecab-dict-gen * /usr/lib/mecab/mecab-test-gen * /usr/lib/mecab/mecab-dict-index * /usr/lib/mecab/mecab-cost-train * /usr/lib/mecab/mecab-system-eval You can invoke these with -h option to get command syntax. You can find more information on mecab at: * https://en.wikipedia.org/wiki/MeCab * https://taku910.github.io/mecab/ (Document from gh-pages branch) * https://github.com/taku910/mecab (source) About unidic-mecab: ------------------- unidic-mecab is a UTF-8 dictionary for mecab which performs the morphological analysis of Japanese texts. Please note the unidic-mecab files are installed into: * /usr/share/mecab/dic/unidic (original text dictionary data) * /var/lib/mecab/dic/unidic (binary compiled dictionary data) This package can coexist with other dictionaries on Debain. * mecab-ipadic * mecab-ipadic-utf8 UTF-8 * mecab-naist-jdic UTF-8 * mecab-naist-jdic-eucjp * mecab-jumandic * mecab-jumandic-utf8 UTF-8 * unidic-mecab UTF-8 (This package) You can select any one of these as mecab's default dictionary with: $ sudo update-alternatives --config mecab-dictionary This can change the default dictionary. Please note differences of their base dictionary size in uncompressed ASCII/UTF-8 (`du -h /usr/share/mecab/dic`). * ipadic: 52M * naist: 145M * juman: 207M * unidic: 6100M (Yes, massive and new) Here are the basic references for unidic-mecab: * https://unidic.ninjal.ac.jp/ * https://unidic.ninjal.ac.jp/glossary * https://unidic.ninjal.ac.jp/faq * https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf UniDic version 1.3.9 ユーザーズマニュアル(2008) - Disregard contents on Chasen related topics. - This is more useful than newer ver. 2.1.1/2.1.2 unidic-mecab.pdf found under ftp://ftp.jaist.ac.jp/pub/sourceforge.jp/unidic/ (pdf and in zip) - ver. 1.3.11 April 2009 and ver. 1.3.12 July 2009 can't be located. * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-U-10-01.pdf 『現代日本語書き言葉均衡コーパス』形態論情報データベースの設計と実装 改訂版 (平成22年/2010) * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf 『現代日本語書き言葉均衡コーパス』形態論情報規程集 第4版(上)(平成22年/2010) * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf 『現代日本語書き言葉均衡コーパス』形態論情報規程集 第4版(下)(平成22年/2010) About lex.csv of unidic: ------------------------ lex.cvs file in /usr/share/mecab/dic/unidic has following starting entries: * 表層形(単語そのもの、活用形等の展開済み) * 左連接状態番号 (left-id.def) * 右連接状態番号 (right-id.def) * コスト (自動計算+学習) * f[0]: pos1 品詞大分類 * f[1]: pos2 品詞中分類 * f[2]: pos3 品詞小分類 * f[3]: pos4 品詞細分類 * f[4]: cType 活用型(「五段・サ行」等) * f[5]: cForm 活用形(「基本形」、「未然形」等) * f[6]: lForm 語彙素読み(カタカナ) * f[7]: lemma 語彙素(+語彙素細分類) (漢字かな交じり、カタカナ-英語)(イジェクト-eject) * f[8]: orth, orthToken 書字形出現形 * f[9]: pron, pronToken 発音形出現形(カタカナ) * f[10]: orthBase 書字形基本形 * f[11]: pronBase 発音形基本形(カタカナ) * f[12]: goshu, wType 語種(和、漢、混、…) * f[13]: iType 語頭変化化型 * f[14]: iForm 語頭変化形 * f[15]: fType 語末変化化型 * f[16]: fForm 語末変化形 * f[17]: iConType 語頭変化結合型 * f[18]: fConType 語末変化結合型 * f[19]: type,lType 語彙素類(用、体、…) * f[20]: kana, kanaToken 仮名形出現形(カタカナ) * f[21]: kanaBase 仮名形基本形(カタカナ) * f[22]: form 語形出現形(口語?) * f[23]: formBase 語形基本形(口語?) * f[24]: aType アクセント型 (3, "4,0") * f[25]: aConType アクセント結合型(C1) * f[26]: aModType アクセント修飾型 (M1@1) * f[27]: lid 語彙表ID * f[28]: lemma_id 語彙素ID left-id.def, right-id.def: ------------------------- left-id.def and right-id.def are defined similarly. See the comment lines defining L[*] and R[*] in feature.def. -- Osamu Aoki Sat, 23 Feb 2019 03:00:28 +0900