From f9dadcd85b6f004d8669b0432947307088bdbb7d Mon Sep 17 00:00:00 2001
From: Rohan Rajpal <rohan17089@iiitd.ac.in>
Date: Wed, 2 Sep 2020 03:28:43 +0530
Subject: [PATCH] Create README.md (#6602)

---
 .../bert-base-en-hi-codemix-cased/README.md   | 101 ++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100644 model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md
diff --git a/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md b/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md
new file mode 100644
index 0000000000..a407ed9daa
--- /dev/null
+++ b/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md
@@ -0,0 +1,101 @@
+---
+language:
+- hi
+- en
+tags:
+- es
+- en
+- codemix
+license: "apache-2.0"
+datasets:
+- SAIL 2017
+metrics:
+- fscore
+- accuracy
+- precision
+- recall
+---
+
+# BERT codemixed base model for Hinglish (cased)
+
+This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics.
+
+## Model description
+
+Input for the model: Any codemixed Hinglish text
+Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
+
+I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.  
+
+## Eval results
+
+Performance of this model on the dataset
+
+| metric     |    score |
+|------------|----------|
+| acc        | 0.55873 |
+| f1         | 0.558369 |
+| acc_and_f1 | 0.558549 |
+| precision  | 0.558075 |
+| recall     | 0.55873 |
+
+#### How to use
+
+Here is how to use this model to get the features of a given text in *PyTorch*:
+
+```python
+# You can include sample code which will be formatted
+from transformers import BertTokenizer, BertModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+
+and in *TensorFlow*:
+
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+
+#### Preprocessing
+
+Followed standard preprocessing techniques:
+- removed digits
+- removed punctuation
+- removed stopwords
+- removed excess whitespace
+Here's the snippet
+
+```python
+from pathlib import Path
+import pandas as pd
+from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
+from lingualytics.stopwords import hi_stopwords,en_stopwords
+from texthero.preprocessing import remove_digits, remove_whitespace
+
+root = Path('<path-to-data>')
+
+for file in 'test','train','validation':
+  tochange = root / f'{file}.txt'
+  df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label'])
+  df['text'] = df['text'].pipe(remove_digits) \
+                                    .pipe(remove_punctuation) \
+                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \
+                                    .pipe(remove_whitespace)
+  df.to_csv(tochange,index=None,header=None,sep='\t')
+```
+
+## Training data
+
+The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model!
+
+## Training procedure
+
+I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).