unidic-mecab for Debian
-----------------------

Quick start:
------------

     $ echo "和布蕪は美味しいね" |mecab -Owakati
    和布蕪 は 美味しい ね

You can drop "-Owakati", replace it with "-Overbose", or replace it with 
"-Ovchamame" to use unidic-mecab with mecab.

About mecab basics:
-------------------

For how to use mecab command itself, install the mecab package and read:

  * https://taku910.github.io/mecab/

For how to manage mecab dictionary, install the mecab-utils package and read:

  * https://taku910.github.io/mecab/learn

Please note that the dictionary utilities need full path to invoke:

  * /usr/lib/mecab/mecab-dict-gen
  * /usr/lib/mecab/mecab-test-gen
  * /usr/lib/mecab/mecab-dict-index
  * /usr/lib/mecab/mecab-cost-train
  * /usr/lib/mecab/mecab-system-eval

You can invoke these with -h option to get command syntax.

You can find more information on mecab at:

  * https://en.wikipedia.org/wiki/MeCab
  * https://taku910.github.io/mecab/     (Document from gh-pages branch)
  * https://github.com/taku910/mecab     (source)

About unidic-mecab:
-------------------

unidic-mecab is a UTF-8 dictionary for mecab which performs
the morphological analysis of Japanese texts.  

Please note the unidic-mecab files are installed into:

  * /usr/share/mecab/dic/unidic (original text dictionary data)
  * /var/lib/mecab/dic/unidic (binary compiled dictionary data)

This package can coexist with other dictionaries on Debain.

  * mecab-ipadic
  * mecab-ipadic-utf8            UTF-8
  * mecab-naist-jdic             UTF-8
  * mecab-naist-jdic-eucjp
  * mecab-jumandic
  * mecab-jumandic-utf8          UTF-8
  * unidic-mecab                 UTF-8 (This package)

You can select any one of these as mecab's default dictionary with:

     $ sudo update-alternatives --config mecab-dictionary

This can change the default dictionary.

Please note differences of their base dictionary size in
uncompressed ASCII/UTF-8 (`du -h /usr/share/mecab/dic`).

  * ipadic:   52M
  * naist:   145M
  * juman:   207M
  * unidic: 6100M (Yes, massive and new)

Here are the basic references for unidic-mecab:

 * https://unidic.ninjal.ac.jp/
 * https://unidic.ninjal.ac.jp/glossary
 * https://unidic.ninjal.ac.jp/faq
 * https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
   UniDic version 1.3.9 ユーザーズマニュアル(2008)
    - Disregard contents on Chasen related topics.
    - This is more useful than newer ver. 2.1.1/2.1.2 unidic-mecab.pdf found
      under ftp://ftp.jaist.ac.jp/pub/sourceforge.jp/unidic/ (pdf and in zip)
    - ver. 1.3.11 April 2009 and ver. 1.3.12 July 2009 can't be located.
 * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-U-10-01.pdf
   『現代日本語書き言葉均衡コーパス』形態論情報データベースの設計と実装 改訂版 (平成22年/2010)
 * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf
   『現代日本語書き言葉均衡コーパス』形態論情報規程集 第４版（上）(平成22年/2010)
 * https://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf
   『現代日本語書き言葉均衡コーパス』形態論情報規程集 第４版（下）(平成22年/2010)

About lex.csv of unidic:
------------------------

lex.cvs file in /usr/share/mecab/dic/unidic has following starting entries:

 * 表層形(単語そのもの、活用形等の展開済み)
 * 左連接状態番号 (left-id.def)
 * 右連接状態番号 (right-id.def)
 * コスト (自動計算＋学習)
 * f[0]:  pos1            品詞大分類
 * f[1]:  pos2            品詞中分類
 * f[2]:  pos3            品詞小分類
 * f[3]:  pos4            品詞細分類
 * f[4]:  cType           活用型(「五段・サ行」等)
 * f[5]:  cForm           活用形(「基本形」、「未然形」等)
 * f[6]:  lForm           語彙素読み(カタカナ)
 * f[7]:  lemma           語彙素（＋語彙素細分類）
                         （漢字かな交じり、カタカナ-英語）（イジェクト-eject）
 * f[8]:  orth, orthToken 書字形出現形
 * f[9]:  pron, pronToken 発音形出現形(カタカナ)
 * f[10]: orthBase        書字形基本形
 * f[11]: pronBase        発音形基本形(カタカナ)
 * f[12]: goshu, wType    語種(和、漢、混、…)
 * f[13]: iType           語頭変化化型
 * f[14]: iForm           語頭変化形
 * f[15]: fType           語末変化化型
 * f[16]: fForm           語末変化形
 * f[17]: iConType        語頭変化結合型
 * f[18]: fConType        語末変化結合型
 * f[19]: type,lType      語彙素類（用、体、…）
 * f[20]: kana, kanaToken 仮名形出現形(カタカナ)
 * f[21]: kanaBase        仮名形基本形(カタカナ)
 * f[22]: form 	        語形出現形(口語？)
 * f[23]: formBase        語形基本形(口語？)
 * f[24]: aType		アクセント型 (3, "4,0")
 * f[25]: aConType 	アクセント結合型(C1)
 * f[26]: aModType	アクセント修飾型 (M1@1)
 * f[27]: lid             語彙表ID
 * f[28]: lemma_id        語彙素ID

left-id.def, right-id.def:
-------------------------

left-id.def and right-id.def are defined similarly.  See the comment lines
defining L[*] and R[*] in feature.def.

 -- Osamu Aoki <osamu@debian.org>  Sat, 23 Feb 2019 03:00:28 +0900