From 42e8fbfc51ae4990b24a3c92fa0c5d3481dfc821 Mon Sep 17 00:00:00 2001 From: Stefan Dumitrescu Date: Tue, 19 May 2020 01:48:56 +0300 Subject: [PATCH] Added model cards for Romanian BERT models (#4437) * Create README.md * Create README.md * Update README.md * Update README.md * Apply suggestions from code review Co-authored-by: Julien Chaumond --- .../bert-base-romanian-cased-v1/README.md | 48 +++++++++++++++++ .../bert-base-romanian-uncased-v1/README.md | 51 +++++++++++++++++++ 2 files changed, 99 insertions(+) create mode 100644 model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md create mode 100644 model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md diff --git a/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md b/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md new file mode 100644 index 0000000000..0dc6cd867f --- /dev/null +++ b/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md @@ -0,0 +1,48 @@ +--- +language: romanian +--- + +# bert-base-romanian-cased-v1 + +The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666) + +### How to use + +```python +from transformers import AutoTokenizer, AutoModel +import torch +# load tokenizer and model +tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1") +model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1") +# tokenize a sentence and run through the model +input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids) +# get encoding +last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple +``` + +### Evaluation + +Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). + +The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian. + +| Model | UPOS | XPOS | NER | LAS | +|--------------------------------|:-----:|:------:|:-----:|:-----:| +| bert-base-multilingual-cased | 97.87 | 96.16 | 84.13 | 88.04 | +| bert-base-romanian-cased-v1 | **98.00** | **96.46** | **85.88** | **89.69** | + +### Corpus + +The model is trained on the following corpora (stats in the table below are after cleaning): + +| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | +|----------- |:--------: |:--------: |:--------: |:--------: | +| OPUS | 55.05 | 635.04 | 4.045 | 3.8 | +| OSCAR | 33.56 | 1725.82 | 11.411 | 11 | +| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | +| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** | + +#### Acknowledgements + +- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome! diff --git a/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md b/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md new file mode 100644 index 0000000000..dda4116559 --- /dev/null +++ b/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md @@ -0,0 +1,51 @@ +--- +language: romanian +--- + +# bert-base-romanian-uncased-v1 + +The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666) + +### How to use + +```python +from transformers import AutoTokenizer, AutoModel +import torch + +# load tokenizer and model +tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True) +model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1") + +# tokenize a sentence and run through the model +input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids) + +# get encoding +last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple +``` + +### Evaluation + +Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). + +The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian. + +| Model | UPOS | XPOS | NER | LAS | +|--------------------------------|:-----:|:------:|:-----:|:-----:| +| bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 | +| bert-base-romanian-uncased-v1 | **98.18** | **96.84** | **85.26** | **89.61** | + +### Corpus + +The model is trained on the following corpora (stats in the table below are after cleaning): + +| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | +|----------- |:--------: |:--------: |:--------: |:--------: | +| OPUS | 55.05 | 635.04 | 4.045 | 3.8 | +| OSCAR | 33.56 | 1725.82 | 11.411 | 11 | +| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | +| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** | + +#### Acknowledgements + +- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!