Metadata-Version: 2.1
Name: bunkai
Version: 1.2.0
Summary: Sentence boundary disambiguation tool for Japanese texts
Home-page: https://github.com/megagonlabs/bunkai
License: Apache-2.0
Keywords: Japanese,Sentence boundary disambiguation
Author: Yuta Hayashibe
Author-email: hayashibe@megagon.ai
Maintainer: Yuta Hayashibe
Maintainer-email: hayashibe@megagon.ai
Requires-Python: >=3.7,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: dataclasses-json (>=0.5.2,<0.6.0)
Requires-Dist: emoji (>=1.2.0)
Requires-Dist: emojis (>=0.6.0)
Requires-Dist: janome (>=0.4.1,<0.5.0)
Requires-Dist: more_itertools (>=8.6.0,<9.0.0)
Requires-Dist: numpy (>=1.16.0,<2.0.0)
Requires-Dist: seqeval (>=1.2.2,<2.0.0)
Requires-Dist: spans (>=1.1.0,<2.0.0)
Requires-Dist: torch (>=1.3.0,<2.0.0)
Requires-Dist: tqdm
Requires-Dist: transformers (>=4.3.2,<5.0.0)
Project-URL: Repository, https://github.com/megagonlabs/bunkai
Description-Content-Type: text/markdown

# Bunkai

[![PyPI version](https://badge.fury.io/py/bunkai.svg)](https://badge.fury.io/py/bunkai)
[![Python Versions](https://img.shields.io/pypi/pyversions/bunkai.svg)](https://pypi.org/project/bunkai/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![CircleCI](https://circleci.com/gh/megagonlabs/bunkai.svg?style=svg&circle-token=c555b8070630dfe98f0406a3892fc228b2370951)](https://app.circleci.com/pipelines/github/megagonlabs/bunkai)
[![Maintainability](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/maintainability)](https://codeclimate.com/github/megagonlabs/bunkai/maintainability)
[![Test Coverage](https://api.codeclimate.com/v1/badges/640b02fa0164c131da10/test_coverage)](https://codeclimate.com/github/megagonlabs/bunkai/test_coverage)
[![markdownlint](https://img.shields.io/badge/markdown-lint-lightgrey)](https://github.com/markdownlint/markdownlint)
[![jsonlint](https://img.shields.io/badge/json-lint-lightgrey)](https://github.com/dmeranda/demjson)
[![yamllint](https://img.shields.io/badge/yaml-lint-lightgrey)](https://github.com/adrienverge/yamllint)

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.  
    Bunkaiは日本語文境界判定器です．

## Quick Start

### Install

```console
$ pip install -U bunkai
```

### Disambiguation without Models

```console
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
```

- Feed a document as one line by using ``▁`` (U+2581) for line breaks.  
    1行は1つの文書を表します．文書中の改行は ``▁`` (U+2581) で与えてください．
- The output shows sentence boundaries with ``│`` (U+2502).  
    出力では文境界は``│`` (U+2502) で表示されます．

### Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a ``--model`` option with the path to the model.  
    改行記号に対しても文境界判定を行いたい場合は，``--model``オプションを与える必要があります．

First time, please setup a model. It will take some time.  
    はじめにモデルをセットアップする必要があります．セットアップには少々時間がかかります．

```console
$ bunkai --model bunkai-model-directory --setup
```

Then, please designate the directory.  
    そしてモデルを指定して動かしてください．

```console
$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。
```

### Python Library

You can also use Bunkai as Python library.  
  BunkaiはPythonライブラリとしても使えます．

```python
from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます！"):
    print(sentence)
```

For more information, see [examples](example).  
    ほかの例は[examples](example)をご覧ください．

## Documents

- [Documents](docs)

## References

- Yuta Hayashibe and Kensuke Mitsuzawa.
    Sentence Boundary Detection on Line Breaks in Japanese.
    Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75.
    November 2020.
    [[PDF]](https://www.aclweb.org/anthology/2020.wnut-1.10.pdf)
    [[bib]](https://www.aclweb.org/anthology/2020.wnut-1.10.bib)

## License

Apache License 2.0

