Adding model cards for 5 models (#6703)

* Added model cards for 4 models

Added model cards for:
- roberta-base-bulgarian
- roberta-base-bulgarian-pos
- roberta-small-bulgarian
- roberta-small-bulgarian-pos

* fixed link text

* Update README.md

* Create README.md

* removed trailing bracket

* Add language metadata

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
This commit is contained in:
Adam Montgomerie 2020-08-27 06:20:55 +09:00 committed by GitHub
parent 3242e4d942
commit baeba53e88
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 138 additions and 0 deletions

View File

@ -0,0 +1,28 @@
# BERT-base-cased-qa-evaluator
This model takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is a pretrained [BERT-base-cased](https://huggingface.co/bert-base-cased) with a sequence classification head.
## Intended uses
The QA evaluator was originally designed to be used with the [t5-base-question-generator](https://huggingface.co/iarfmoose/t5-base-question-generator) for evaluating the quality of generated questions.
The input for the QA evaluator follows the format for `BertForSequenceClassification`, but using the question and answer as the two sequences. Inputs should take the following format:
```
[CLS] <question> [SEP] <answer [SEP]
```
## Limitations and bias
The model is trained to evaluate if a question and answer are semantically related, but cannot determine whether an answer is actually true/correct or not.
## Training data
The training data was made up of question-answer pairs from the following datasets:
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)
- [RACE](http://www.cs.cmu.edu/~glai1/data/race/)
- [CoQA](https://stanfordnlp.github.io/coqa/)
- [MSMARCO](https://microsoft.github.io/msmarco/)
## Training procedure
The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.

View File

@ -0,0 +1,26 @@
---
language: bg
---
# RoBERTa-base-bulgarian-POS
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian) fine-tuned for part-of-speech tagging.
## Intended uses
The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
## Limitations and bias
The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from [UD_Bulgarian-BTB](https://github.com/UniversalDependencies/UD_Bulgarian-BTB).
## Training procedure
The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 97% on the test set.

View File

@ -0,0 +1,29 @@
---
language: bg
---
# RoBERTa-base-bulgarian
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a version of [RoBERTa-base](https://huggingface.co/roberta-base) pretrained on Bulgarian text.
## Intended uses
This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
## Limitations and bias
The training data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
## Training procedure
The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
It was trained for 200k steps. The batch size was limited to 8 due to GPU memory limitations.

View File

@ -0,0 +1,26 @@
---
language: bg
---
# RoBERTa-small-bulgarian-POS
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-small-Bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) fine-tuned for part-of-speech tagging.
## Intended uses
The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
## Limitations and bias
The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from (UD_Bulgarian-BTB)[https://github.com/UniversalDependencies/UD_Bulgarian-BTB].
## Training procedure
The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 98% on the test set.

View File

@ -0,0 +1,29 @@
---
language: bg
---
# RoBERTa-small-bulgarian
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This is a smaller version of [RoBERTa-base-bulgarian](https://huggingface.co/iarfmoose/roberta-small-bulgarian) with only 6 hidden layers, but similar performance.
## Intended uses
This model can be used for cloze tasks (masked language modeling) or finetuned on other tasks in Bulgarian.
## Limitations and bias
The training data is unfiltered text from the internet and may contain all sorts of biases.
## Training data
This model was trained on the following data:
- [bg_dedup from OSCAR](https://oscar-corpus.com/)
- [Newscrawl 1 million sentences 2017 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
- [Wikipedia 1 million sentences 2016 from Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/bulgarian)
## Training procedure
The model was pretrained using a masked language-modeling objective with dynamic masking as described [here](https://huggingface.co/roberta-base#preprocessing)
It was trained for 160k steps. The batch size was limited to 8 due to GPU memory limitations.