diff --git a/model_cards/voidful/albert_chinese_base/README.md b/model_cards/voidful/albert_chinese_base/README.md new file mode 100644 index 0000000000..67d6932592 --- /dev/null +++ b/model_cards/voidful/albert_chinese_base/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_base + +This a albert_chinese_base model from [Google's github](https://github.com/google-research/ALBERT) +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_base model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_base 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_base' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `感 0.36333346366882324` diff --git a/model_cards/voidful/albert_chinese_large/README.md b/model_cards/voidful/albert_chinese_large/README.md new file mode 100644 index 0000000000..aac9a4d45b --- /dev/null +++ b/model_cards/voidful/albert_chinese_large/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_large + +This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT) +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_large model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_large 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_large' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `心 0.9422469735145569` diff --git a/model_cards/voidful/albert_chinese_small/README.md b/model_cards/voidful/albert_chinese_small/README.md new file mode 100644 index 0000000000..59c5c1fc5b --- /dev/null +++ b/model_cards/voidful/albert_chinese_small/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_small + +This a albert_chinese_small model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_small_google_zh model +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_small model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_small 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_small' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `感 0.6390823125839233` diff --git a/model_cards/voidful/albert_chinese_tiny/README.md b/model_cards/voidful/albert_chinese_tiny/README.md new file mode 100644 index 0000000000..09fcc56818 --- /dev/null +++ b/model_cards/voidful/albert_chinese_tiny/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_tiny + +This a albert_chinese_tiny model from [brightmart/albert_zh project](https://github.com/brightmart/albert_zh), albert_tiny_google_zh model +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_tiny model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_tiny 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_tiny' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `感 0.40312355756759644` diff --git a/model_cards/voidful/albert_chinese_xlarge/README.md b/model_cards/voidful/albert_chinese_xlarge/README.md new file mode 100644 index 0000000000..bd18f9ce81 --- /dev/null +++ b/model_cards/voidful/albert_chinese_xlarge/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_xlarge + +This a albert_chinese_xlarge model from [Google's github](https://github.com/google-research/ALBERT) +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_xlarge model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_xlarge 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_xlarge' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `心 0.9942440390586853` diff --git a/model_cards/voidful/albert_chinese_xxlarge/README.md b/model_cards/voidful/albert_chinese_xxlarge/README.md new file mode 100644 index 0000000000..77c998311b --- /dev/null +++ b/model_cards/voidful/albert_chinese_xxlarge/README.md @@ -0,0 +1,44 @@ +--- +language: +- Chinese +--- + +# albert_chinese_xxlarge + +This a albert_chinese_xxlarge model from [Google's github](https://github.com/google-research/ALBERT) +converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py) + +## Attention (注意) + +Since sentencepiece is not used in albert_chinese_xxlarge model +you have to call BertTokenizer instead of AlbertTokenizer !!! +we can eval it using an example on MaskedLM + +由於 albert_chinese_xxlarge 模型沒有用 sentencepiece +用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!! +我們可以跑MaskedLM預測來驗證這個做法是否正確 + +## Justify (驗證有效性) +[colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj) +```python +from transformers import * +import torch +from torch.nn.functional import softmax + +pretrained = 'voidful/albert_chinese_xxlarge' +tokenizer = BertTokenizer.from_pretrained(pretrained) +model = AlbertForMaskedLM.from_pretrained(pretrained) + +inputtext = "今天[MASK]情很好" + +maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103) + +input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1 +outputs = model(input_ids, masked_lm_labels=input_ids) +loss, prediction_scores = outputs[:2] +logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist() +predicted_index = torch.argmax(prediction_scores[0, maskpos]).item() +predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] +print(predicted_token,logit_prob[predicted_index]) +``` +Result: `心 0.995713472366333`