Metadata-Version: 2.1
Name: Fern2
Version: 0.8.0
Summary: NLP text processing toolkit for Deep Learning
Home-page: https://github.com/Jasonsey/Fern
Author: Jason, Lin
Author-email: jason.m.lin@outlook.com
License: Apache 2.0
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft
Classifier: Operating System :: POSIX :: Linux
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: tensorflow (>=2.0.0)
Requires-Dist: tqdm (>=4.0.0)
Requires-Dist: nltk (>=3.4.5)
Requires-Dist: pymssql (>=2.1.4)
Requires-Dist: numpy (>=1.18.1)
Requires-Dist: pandas (>=1.0.3)
Requires-Dist: jieba (>=0.39)
Requires-Dist: sqlalchemy (>=1.3.19)

# FERN

>  **English Version** | [中文版](./README_ZH.md)

Fern defines a model development structure control for NLP. With the help of Fern, the text preprocessing, model building and model training can be implemented quickly. These modules contain the following functions:

1. Text preprocessing: data downloader, data cleaner, data transformer and data splitter
2. Model building: model saving, loading and architecture printing
3. Model training: step /epochs training and evaluation, evaluation function setting, loss function setting and label weight setting

The design purpose of Fern is mainly to solve the problem of too much repetitive code in different NLP projects and reduce the flow code, so as to avoid random bugs in the process data interaction

## INSTALL

1. Install from  `pypi`

   ```shell
   $ pip install Fern2
   ```

2. Install from source code

   ```shell
   $ pip install -e git+https://github.com/Jasonsey/Fern.git@develop
   ```

## TUTORIAL

This is a quick tutorial that covers the basics of all classes. For more usage methods, it is recommended to see the instructions for the functions in the source code

### DATA PREPARATION

1. Data download

   ```python
   from fern.utils.data import BaseDownloader


   loader = BaseDownloader(host=config.HOST, user=config.USER, password=config.PASSWORD)
   loader.read_msssql(sql=config.SQL)
   loader.save(config.SOURCE_PATH)
   ```

2. Load the downloaded data from disk

   ```python
   loader.load(config.SOURCE_PATH)
   ```

3. Data cleaning

   ```python
   from fern.utils.data import BaseCleaner


   class DataCleaner(BaseCleaner):
       def clean_label(self, row):
           return row['LABEL']

       def clean_data(self, row):
       		data = row['DATA']
           res = do_clean(data)
           return res


   cleaner = DataCleaner(stop_words=config.STOP_WORDS, user_words=config.USER_WORDS)
   cleaner.clean(loader.data)
   ```

4. Data transforming

   ```python
   from fern.utils.data import BaseTransformer


   class DataTransformer(BaseTransformer):
       def transform_label(self, label):
           res = np.zeros([1] + self.output_shape, np.float32)
           for i in range(len(str(label))):
               number = int(str(label)[i])
               res[:, i, number] = 1.0
           return res


   transformer = DataTransformer(
       data=cleaner.data,
       word_path=config.WORDS_LIBRARY,
       min_len=config.MIN_SEQ_LEN,
       max_len=config.MAX_SEQ_LEN,
       min_freq=config.MAX_WORD_FREQ,
       output_shape=config.OUTPUT_SHAPE,
       filter_data=True)
   transformer.transform(data=cleaner.data)
   transformer.save(config.TRANSFORMED_DATA)
   ```

5. Data segmentation

   ```python
   from fern.utils.data import BaseSplitter


   splitter = BaseSplitter(rate_val=config.RATE_VAL)
   splitter.split(transformer.data)
   splitter.save(config.SPLIT_DATA)
   ```

### MODEL SEARCH

1. Configure the list of models to be searched


  ```python
  from fern.utils.model import BaseModel

  class TextConv1D_1(BaseModel):
      def build(self):
          inp = layers.Input(shape=(self.max_seq_len,))
          x = layers.Dense(12)(x)
          oup = layers.Activation('softmax')(x)
          model = Model(
              inputs=inp,
              outputs=oup,
              name=self.name)
          return model

  class TextConv1D_2(BaseModel):
      def build(self):
          inp = layers.Input(shape=(self.max_seq_len,))
          x = layers.Dense(24)(x)
          oup = layers.Activation('softmax')(x)
          model = Model(
            	inputs=inp,
            	outputs=oup,
            	name=self.name)
          return model

  UNOPTIMIZED_MODELS = [TextConv1D_1, TextConv1D_2]
  ```

2. Searching the best model

   ```python
   from fern.utils.train import BaseTrainer


   best_score = 0
   best_epoch = 0
   best_model = ''

   for model in UNOPTIMIZED_MODELS:
   		tf.keras.backend.clear_session()
   		try:
       	my_model = model(
             output_shape=config.OUTPUT_SHAPE, 
             max_seq_len=config.MAX_SEQ_LEN, 
             library_len=library_len)
       	trainer = BaseTrainer(
             model=my_model,
             path_data=config.SPLIT_DATA,
             lr=config.LR,
             batch_size=config.BATCH_SIZE)
       score, epoch = trainer.train(
           config.EPOCHS,
           early_stop=config.EARLY_STOP)
       if score > best_score:
         best_score = score
         best_epoch = epoch
         best_model = my_model.name

   print(f'Best Model: {best_model}, Best Score: {best_score}, Best Epoch: {best_epoch}')
   ```

### TRAINING THE BEST MODEL

```python
my_model = UNOPTIMIZED_MODELS[0](
    output_shape=config.OUTPUT_SHAPE, 
    max_seq_len=config.MAX_SEQ_LEN,
    library_len=library_len)

trainer = ModelTrainer(
    model=my_model,
    path_data=config.SPLIT_DATA,
    lr=config.LR,
    batch_size=config.BATCH_SIZE)

_ = trainer.train(config.BEST_EPOCH, mode='server')
trainer.save(config.MODEL_PATH)
```

## VARIABLE NAMING RULE

In order to facilitate the definition, the following convention is made for the naming of easily divergent variables：

1. For data variables, write rules for variables of the same type：
   - `data_train`, `data_val`
   - `label_train`, `label_val`

2. For indicator variables, write rules for variables of the same type：
   - `val_loss`, `val_acc`, `val_binary_acc`
   - `train_loss`, `train_acc`

3. For other variables, according to the rule that first it belongs to a and second it belongs to b：`a_b`

   - `path_dataset`

## CNAGE LOG

[CHANGE LOG](./CHANGELOG.md)



