From f9dadcd85b6f004d8669b0432947307088bdbb7d Mon Sep 17 00:00:00 2001 From: Rohan Rajpal Date: Wed, 2 Sep 2020 03:28:43 +0530 Subject: [PATCH] Create README.md (#6602) --- .../bert-base-en-hi-codemix-cased/README.md | 101 ++++++++++++++++++ 1 file changed, 101 insertions(+) create mode 100644 model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md diff --git a/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md b/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md new file mode 100644 index 0000000000..a407ed9daa --- /dev/null +++ b/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md @@ -0,0 +1,101 @@ +--- +language: +- hi +- en +tags: +- es +- en +- codemix +license: "apache-2.0" +datasets: +- SAIL 2017 +metrics: +- fscore +- accuracy +- precision +- recall +--- + +# BERT codemixed base model for Hinglish (cased) + +This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics. + +## Model description + +Input for the model: Any codemixed Hinglish text +Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive) + +I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset. + +## Eval results + +Performance of this model on the dataset + +| metric | score | +|------------|----------| +| acc | 0.55873 | +| f1 | 0.558369 | +| acc_and_f1 | 0.558549 | +| precision | 0.558075 | +| recall | 0.55873 | + +#### How to use + +Here is how to use this model to get the features of a given text in *PyTorch*: + +```python +# You can include sample code which will be formatted +from transformers import BertTokenizer, BertModelForSequenceClassification +tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased') +model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased') +text = "Replace me by any text you'd like." +encoded_input = tokenizer(text, return_tensors='pt') +output = model(**encoded_input) +``` + +and in *TensorFlow*: + +```python +from transformers import BertTokenizer, TFBertModel +tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased') +model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased') +text = "Replace me by any text you'd like." +encoded_input = tokenizer(text, return_tensors='tf') +output = model(encoded_input) +``` + +#### Preprocessing + +Followed standard preprocessing techniques: +- removed digits +- removed punctuation +- removed stopwords +- removed excess whitespace +Here's the snippet + +```python +from pathlib import Path +import pandas as pd +from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords +from lingualytics.stopwords import hi_stopwords,en_stopwords +from texthero.preprocessing import remove_digits, remove_whitespace + +root = Path('') + +for file in 'test','train','validation': + tochange = root / f'{file}.txt' + df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label']) + df['text'] = df['text'].pipe(remove_digits) \ + .pipe(remove_punctuation) \ + .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \ + .pipe(remove_whitespace) + df.to_csv(tochange,index=None,header=None,sep='\t') +``` + +## Training data + +The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model! + +## Training procedure + +I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).