Tokenizers v3.0.0 (#3185)

* Renamed num_added_tokens to num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
This commit is contained in:
Funtowicz Morgan 2020-04-06 22:29:15 +00:00 committed by GitHub
parent e52d1258e0
commit 96ab75b8dd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 852 additions and 579 deletions

View File

@ -96,7 +96,7 @@ setup(
packages=find_packages("src"),
install_requires=[
"numpy",
"tokenizers == 0.5.2",
"tokenizers == 0.7.0rc3",
# dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'",
# accessing files from S3 directly

View File

@ -459,7 +459,7 @@ class Pipeline(_ScikitCompat):
)
# Filter out features not available on specific models
inputs = self.inputs_for_model(inputs)
# inputs = self.inputs_for_model(inputs)
return inputs
@ -480,7 +480,7 @@ class Pipeline(_ScikitCompat):
with self.device_placement():
if self.framework == "tf":
# TODO trace model
predictions = self.model(inputs, training=False)[0]
predictions = self.model(inputs.data, training=False)[0]
else:
with torch.no_grad():
inputs = self.ensure_tensor_on_device(**inputs)
@ -778,7 +778,7 @@ class NerPipeline(Pipeline):
# Forward
if self.framework == "tf":
entities = self.model(tokens)[0][0].numpy()
entities = self.model(tokens.data)[0][0].numpy()
input_ids = tokens["input_ids"].numpy()[0]
else:
with torch.no_grad():
@ -1399,7 +1399,7 @@ SUPPORTED_TASKS = {
"tf": "distilbert-base-uncased-finetuned-sst-2-english",
},
"config": "distilbert-base-uncased-finetuned-sst-2-english",
"tokenizer": "distilbert-base-uncased",
"tokenizer": "distilbert-base-cased",
},
},
"ner": {

View File

@ -592,8 +592,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
self,
vocab_file,
do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token="[UNK]",
sep_token="[SEP]",
pad_token="[PAD]",
@ -601,7 +599,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
mask_token="[MASK]",
clean_text=True,
tokenize_chinese_chars=True,
add_special_tokens=True,
strip_accents=True,
wordpieces_prefix="##",
**kwargs
@ -609,7 +606,6 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
super().__init__(
BertWordPieceTokenizer(
vocab_file=vocab_file,
add_special_tokens=add_special_tokens,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,

View File

@ -18,9 +18,11 @@
import logging
from typing import List, Optional
from tokenizers import AddedToken
from tokenizers.processors import RobertaProcessing
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
from .tokenization_utils import PreTrainedTokenizer
logger = logging.getLogger(__name__)
@ -259,7 +261,7 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
unk_token="<unk>",
pad_token="<pad>",
mask_token="<mask>",
add_prefix_space=False,
add_prefix_space=True,
**kwargs
):
kwargs.setdefault("pad_token", pad_token)
@ -281,16 +283,24 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
(sep_token, self.sep_token_id), (cls_token, self.cls_token_id)
)
self.tokenizer.add_special_tokens([kwargs["mask_token"]])
# As we override the post_processor post super.__init__ the computed num_added_tokens is wrong in super().
# We need to recompute max_len according to the newly register post_processor to get real values.
self.max_len_single_sentence = self.max_len - self.num_added_tokens(False) # take into account special tokens
self.max_len_sentences_pair = self.max_len - self.num_added_tokens(True) # take into account special tokens
self.max_len_single_sentence = self.max_len - self.num_special_tokens_to_add(
False
) # take into account special tokens
self.max_len_sentences_pair = self.max_len - self.num_special_tokens_to_add(
True
) # take into account special tokens
logger.warning(
"RobertaTokenizerFast has an issue when working on mask language modeling "
"where it introduces an extra encoded space before the mask token."
"See https://github.com/huggingface/transformers/pull/2778 for more information."
)
@PreTrainedTokenizer.mask_token.setter
def mask_token(self, value):
if not isinstance(value, AddedToken):
value = AddedToken(value, lstrip=True)
self._mask_token = str(value)
self.tokenizer.add_special_tokens([value])
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]

View File

@ -24,13 +24,13 @@ import os
import pickle
import re
from collections import Counter, OrderedDict
from typing import List, Optional, Tuple, Union
from typing import Optional
import numpy as np
from tokenizers import Encoding, Tokenizer
from tokenizers import Tokenizer
from tokenizers.implementations import BaseTokenizer
from tokenizers.models import WordLevel
from tokenizers.normalizers import Lowercase, Sequence, unicode_normalizer_from_str
from tokenizers.normalizers import Lowercase, Sequence, Strip, unicode_normalizer_from_str
from tokenizers.pre_tokenizers import CharDelimiterSplit, WhitespaceSplit
from tokenizers.processors import BertProcessing
@ -381,6 +381,9 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
if lowercase:
normalizer += [Lowercase()]
# Strip normalizer at the end
normalizer += [Strip(left=True, right=True)]
if len(normalizer) > 0:
tokenizer.normalizer = Sequence(normalizer) if len(normalizer) > 1 else normalizer[0]
@ -404,14 +407,6 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
super().__init__(tokenizer, parameters)
def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
return super().encode_batch(
[seq.strip() if isinstance(seq, str) else (seq[0].strip(), seq[1].strip()) for seq in sequences]
)
def encode(self, sequence: str, pair: Optional[str] = None) -> Encoding:
return super().encode(sequence.strip(), pair.strip() if pair else pair)
class TransfoXLTokenizerFast(PreTrainedTokenizerFast):

File diff suppressed because it is too large Load Diff

View File

@ -64,7 +64,7 @@ TF_TEXT_CLASSIF_FINETUNED_MODELS = {
TEXT_CLASSIF_FINETUNED_MODELS = {
(
"bert-base-uncased",
"distilbert-base-cased",
"distilbert-base-uncased-finetuned-sst-2-english",
"distilbert-base-uncased-finetuned-sst-2-english",
)

View File

@ -82,7 +82,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
return
tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer(add_special_tokens=False)
rust_tokenizer = self.get_rust_tokenizer()
sequence = "UNwant\u00E9d,running"
@ -91,7 +91,7 @@ class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
self.assertListEqual(tokens, rust_tokens)
ids = tokenizer.encode(sequence, add_special_tokens=False)
rust_ids = rust_tokenizer.encode(sequence)
rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, rust_ids)
rust_tokenizer = self.get_rust_tokenizer()

View File

@ -282,7 +282,7 @@ class TokenizerTesterMixin:
# Method is implemented (e.g. not GPT-2)
if len(attached_sequences) != 2:
self.assertEqual(tokenizer.num_added_tokens(pair=True), len(attached_sequences) - len(sequences))
self.assertEqual(tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences))
def test_maximum_encoding_length_single_input(self):
tokenizer = self.get_tokenizer()
@ -291,7 +291,7 @@ class TokenizerTesterMixin:
stride = 2
sequence = tokenizer.encode(seq_0, add_special_tokens=False)
num_added_tokens = tokenizer.num_added_tokens()
num_added_tokens = tokenizer.num_special_tokens_to_add()
total_length = len(sequence) + num_added_tokens
information = tokenizer.encode_plus(
seq_0,

View File

@ -1,6 +1,6 @@
import unittest
import numpy as np
from collections import namedtuple
from itertools import takewhile
from tests.utils import require_torch
from transformers import (
@ -21,117 +21,112 @@ from transformers.tokenization_roberta import RobertaTokenizerFast
from transformers.tokenization_transfo_xl import TransfoXLTokenizerFast
class FastTokenizerMatchingTest(unittest.TestCase):
NON_ENGLISH_TAGS = ["chinese", "dutch", "french", "finnish", "german", "multilingual"]
Tokenizer = namedtuple("Tokenizer", ["name", "rust_cls", "python_cls", "vocab_key", "filter"])
def filter_non_english(_: Tokenizer, pretrained_name: str):
""" Filter all the model for non-english language """
return not any([lang in pretrained_name for lang in NON_ENGLISH_TAGS])
def filter_roberta_detectors(_: Tokenizer, pretrained_name: str):
return "detector" not in pretrained_name
class CommonFastTokenizerTest(unittest.TestCase):
TOKENIZERS_CLASSES = frozenset([])
def setUp(self) -> None:
with open("tests/fixtures/sample_text.txt") as f_data:
with open("tests/fixtures/sample_text.txt", encoding="utf-8") as f_data:
self._data = f_data.read().replace("\n\n", "\n").strip()
def assert_sequence_almost_equals(self, a, b, threshold):
def test_all_tokenizers(self):
for tok_case in self.TOKENIZERS_CLASSES:
for pretrained_name in tok_case.python_cls.pretrained_vocab_files_map[tok_case.vocab_key].keys():
# Handle padding
if len(a) != len(b):
max_len = max(len(a), len(b))
# Tokenizer.filter makes it possible to filter which Tokenizer to case based on all the
# information available in Tokenizer (name, rust class, python class, vocab key name)
if tok_case.filter is None or (
tok_case.filter is not None and tok_case.filter(tok_case, pretrained_name)
):
with self.subTest("{} ({})".format(tok_case.name, pretrained_name)):
tokenizer_r = tok_case.rust_cls.from_pretrained(pretrained_name)
tokenizer_p = tok_case.python_cls.from_pretrained(pretrained_name)
# Pad with a negative number as vocab doesnt allow idx < 0
# if will be tracked as differences
if len(a) < max_len:
a += [-1] * (max_len - len(a))
self.fast_align_python(tokenizer_r, tokenizer_p)
self.fast_only(tokenizer_r)
if len(b) < max_len:
b += [-1] * (max_len - len(b))
def fast_align_python(self, tokenizer_r, tokenizer_p):
# Check is_fast is set correctly
self.assertFalse(tokenizer_p.is_fast)
self.assertTrue(tokenizer_r.is_fast)
# Convert to numpy for convenience
a_, b_ = np.array(a), np.array(b)
# Check that Rust and Python align
self.assert_tokenization_python_rust_equals(tokenizer_r, tokenizer_p)
self.assert_num_special_tokens_to_add_equal(tokenizer_r, tokenizer_p)
self.assert_max_length_equal(tokenizer_r, tokenizer_p)
self.assert_special_tokens_map_equal(tokenizer_r, tokenizer_p)
self.assert_embeded_special_tokens(tokenizer_r, tokenizer_p)
self.assert_padding(tokenizer_r, tokenizer_p)
# TODO: enable for v3.0.0
# self.assert_empty_output_no_special_tokens(tokenizer_r, tokenizer_p)
# Compute elementwise difference
inputs_diffs = a_ - b_
inputs_diff = np.count_nonzero(inputs_diffs)
self.assertLessEqual(inputs_diff / a_.shape[0], threshold)
def fast_only(self, tokenizer_r):
# Ensure None raise an error
self.assertRaises(ValueError, tokenizer_r.tokenize, None)
self.assertRaises(ValueError, tokenizer_r.encode, None)
self.assertRaises(ValueError, tokenizer_r.encode_plus, None)
self.assertRaises(ValueError, tokenizer_r.batch_encode_plus, None)
def assert_tokenization_python_rust_almost_equals(self, tokenizer_p, tokenizer_r, threshold: float):
self.assert_add_tokens(tokenizer_r)
self.assert_offsets_mapping(tokenizer_r)
self.assert_add_special_tokens(tokenizer_r)
def assert_tokenization_python_rust_equals(self, tokenizer_p, tokenizer_r):
# Ensure basic input match
input_p = tokenizer_p.encode_plus(self._data)
input_r = tokenizer_r.encode_plus(self._data)
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
self.assert_sequence_almost_equals(input_p[key], input_r[key], threshold)
self.assertSequenceEqual(input_p[key], input_r[key])
input_pairs_p = tokenizer_p.encode_plus(self._data, self._data)
input_pairs_r = tokenizer_r.encode_plus(self._data, self._data)
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
self.assert_sequence_almost_equals(input_pairs_p[key], input_pairs_r[key], threshold)
self.assertSequenceEqual(input_pairs_p[key], input_pairs_r[key])
# Ensure truncation match
input_p = tokenizer_p.encode_plus(self._data, max_length=512)
input_r = tokenizer_r.encode_plus(self._data, max_length=512)
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
self.assert_sequence_almost_equals(input_p[key], input_r[key], threshold)
self.assertSequenceEqual(input_p[key], input_r[key])
# Ensure truncation with stride match
input_p = tokenizer_p.encode_plus(self._data, max_length=512, stride=3, return_overflowing_tokens=True)
input_r = tokenizer_r.encode_plus(self._data, max_length=512, stride=3, return_overflowing_tokens=True)
for key in filter(lambda x: x in ["input_ids", "token_type_ids", "attention_mask"], input_p.keys()):
self.assert_sequence_almost_equals(input_p[key], input_r[key], threshold)
self.assertSequenceEqual(input_p[key], input_r[key])
def assert_padding(self, tokenizer_r, tokenizer_p):
# Simple input
input_r = tokenizer_r.encode("This is a simple input", max_length=15, pad_to_max_length=True)
input_p = tokenizer_p.encode("This is a simple input", max_length=15, pad_to_max_length=True)
def assert_num_special_tokens_to_add_equal(self, tokenizer_r, tokenizer_p):
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_special_tokens_to_add(False), tokenizer_p.num_special_tokens_to_add(False))
self.assertEqual(tokenizer_r.num_special_tokens_to_add(True), tokenizer_p.num_special_tokens_to_add(True))
self.assertSequenceEqual(input_r, input_p)
def assert_max_length_equal(self, tokenizer_r, tokenizer_p):
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# Simple input
input_r = tokenizer_r.encode_plus("This is a simple input", max_length=15, pad_to_max_length=True)
input_p = tokenizer_p.encode_plus("This is a simple input", max_length=15, pad_to_max_length=True)
self.assertSequenceEqual(input_r, input_p)
# Simple input
# TODO: Re-enable this test when batch_encode_plus with padding correctly handles padding
# input_r = tokenizer_r.batch_encode_plus(
# ["This is a simple input 1", "This is a simple input 2"], max_length=15, pad_to_max_length=True
# )
# input_p = tokenizer_p.batch_encode_plus(
# ["This is a simple input 1", "This is a simple input 2"], max_length=15, pad_to_max_length=True
# )
# self.assertSequenceEqual(input_r, input_p)
# Pair input
input_r = tokenizer_r.encode("This is a simple input", "This is a pair", max_length=15, pad_to_max_length=True)
input_p = tokenizer_p.encode("This is a simple input", "This is a pair", max_length=15, pad_to_max_length=True)
self.assertSequenceEqual(input_r, input_p)
# Pair input
input_r = tokenizer_r.encode_plus(
"This is a simple input", "This is a pair", max_length=15, pad_to_max_length=True
def assert_special_tokens_map_equal(self, tokenizer_r, tokenizer_p):
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(), tokenizer_r.special_tokens_map.items(),
)
input_p = tokenizer_p.encode_plus(
"This is a simple input", "This is a pair", max_length=15, pad_to_max_length=True
)
self.assertSequenceEqual(input_r, input_p)
# Pair input
# TODO: Re-enable this test when batch_encode_plus with padding correctly handles padding
# input_r = tokenizer_r.batch_encode_plus(
# ["This is a simple input 1", "This is a simple input 2"],
# ["This is a simple pair 1", "This is a simple pair 2"],
# max_length=15,
# pad_to_max_length=True,
# )
# input_p = tokenizer_p.batch_encode_plus(
# ["This is a simple input 1", "This is a simple input 2"],
# ["This is a simple pair 1", "This is a simple pair 2"],
# max_length=15,
# pad_to_max_length=True,
# )
# self.assertSequenceEqual(input_r, input_p)
def assert_add_tokens(self, tokenizer_r):
vocab_size = tokenizer_r.vocab_size
@ -150,34 +145,34 @@ class FastTokenizerMatchingTest(unittest.TestCase):
)
self.assertEqual(len(tokenizer_r), vocab_size + 6)
def assert_offsets_mapping(self, tokenizer):
def assert_offsets_mapping(self, tokenizer_r):
text = "Wonderful no inspiration example with subtoken"
pair = "Along with an awesome pair"
# No pair
tokens_with_offsets = tokenizer.encode_plus(text, return_special_tokens_mask=True, return_offsets_mapping=True)
added_tokens = tokenizer.num_added_tokens(False)
tokens_with_offsets = tokenizer_r.encode_plus(
text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
)
added_tokens = tokenizer_r.num_special_tokens_to_add(False)
offsets = tokens_with_offsets["offset_mapping"]
# Assert there is the same number of tokens and offsets
self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
# Assert there is online added_tokens special_tokens
self.assertEqual(sum([0 if x else 1 for x in offsets]), added_tokens)
self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
# Pairs
tokens_with_offsets = tokenizer.encode_plus(
text, pair, return_special_tokens_mask=True, return_offsets_mapping=True
tokens_with_offsets = tokenizer_r.encode_plus(
text, pair, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
)
added_tokens = tokenizer.num_added_tokens(True)
added_tokens = tokenizer_r.num_special_tokens_to_add(True)
offsets = tokens_with_offsets["offset_mapping"]
# Assert there is the same number of tokens and offsets
self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
# Assert there is online added_tokens special_tokens
self.assertEqual(sum([0 if x else 1 for x in offsets]), added_tokens)
self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
def assert_batch_encode_dynamic_overflowing(self, tokenizer: PreTrainedTokenizer):
@ -258,8 +253,89 @@ class FastTokenizerMatchingTest(unittest.TestCase):
output_p = tokenizer_p.build_inputs_with_special_tokens(input_simple, input_pair)
self.assertEqual(output_p, output_r)
def assert_save_pretrained(self, tokenizer_r, tokenizer_p):
def assert_padding(self, tokenizer_r, tokenizer_p, max_length=15):
def assert_padded_input_match(input_r: list, input_p: list, max_length: int):
# Ensure we match max_length
self.assertEqual(len(input_r), max_length), self.assertEqual(len(input_p), max_length)
# Ensure the number of padded tokens is the same
padded_tokens_r = list(takewhile(lambda i: i == tokenizer_r.pad_token_id, reversed(input_r)))
padded_tokens_p = list(takewhile(lambda i: i == tokenizer_p.pad_token_id, reversed(input_p)))
self.assertSequenceEqual(padded_tokens_r, padded_tokens_p)
def assert_batch_padded_input_match(input_r: dict, input_p: dict):
for i_r in input_r.values():
self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), 15), self.assertEqual(len(i_r[1]), 15)
self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), 15), self.assertEqual(len(i_r[1]), 15)
for i_r, i_p in zip(input_r["input_ids"], input_p["input_ids"]):
assert_padded_input_match(i_r, i_p, max_length)
for i_r, i_p in zip(input_r["attention_mask"], input_p["attention_mask"]):
self.assertSequenceEqual(i_r, i_p)
# Simple input
input_r = tokenizer_r.encode("This is a simple input", max_length=max_length, pad_to_max_length=True)
input_p = tokenizer_p.encode("This is a simple input", max_length=max_length, pad_to_max_length=True)
assert_padded_input_match(input_r, input_p, max_length)
# Pair input
input_r = tokenizer_r.encode(
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
)
input_p = tokenizer_p.encode(
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
)
assert_padded_input_match(input_r, input_p, max_length)
# Simple input
input_r = tokenizer_r.encode_plus("This is a simple input", max_length=max_length, pad_to_max_length=True)
input_p = tokenizer_p.encode_plus("This is a simple input", max_length=max_length, pad_to_max_length=True)
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
# Pair input
input_r = tokenizer_r.encode_plus(
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
)
input_p = tokenizer_p.encode_plus(
"This is a simple input", "This is a pair", max_length=max_length, pad_to_max_length=True
)
assert_padded_input_match(input_r["input_ids"], input_p["input_ids"], max_length)
self.assertSequenceEqual(input_r["attention_mask"], input_p["attention_mask"])
# Simple input
# TODO: Re-enable this test when batch_encode_plus with padding correctly handles padding
input_r = tokenizer_r.batch_encode_plus(
["This is a simple input 1", "This is a simple input 2"], max_length=max_length, pad_to_max_length=True
)
input_p = tokenizer_p.batch_encode_plus(
["This is a simple input 1", "This is a simple input 2"], max_length=max_length, pad_to_max_length=True
)
assert_batch_padded_input_match(input_r, input_p)
# Pair input
# TODO: Re-enable this test when batch_encode_plus with padding correctly handles padding
input_r = tokenizer_r.batch_encode_plus(
[
("This is a simple input 1", "This is a simple input 2"),
("This is a simple pair 1", "This is a simple pair 2"),
],
max_length=15,
pad_to_max_length=True,
)
input_p = tokenizer_p.batch_encode_plus(
[
("This is a simple input 1", "This is a simple input 2"),
("This is a simple pair 1", "This is a simple pair 2"),
],
max_length=15,
pad_to_max_length=True,
)
assert_batch_padded_input_match(input_r, input_p)
def assert_save_pretrained(self, tokenizer_r, tokenizer_p):
# Checks it save with the same files
self.assertSequenceEqual(tokenizer_r.save_vocabulary("."), tokenizer_p.save_vocabulary("."))
@ -272,267 +348,178 @@ class FastTokenizerMatchingTest(unittest.TestCase):
# self.assertEqual(getattr(tokenizer_rp, key), getattr(tokenizer_pp, key))
# self.assertEqual(getattr(tokenizer_rp, key + "_id"), getattr(tokenizer_pp, key + "_id"))
def test_bert(self):
for tokenizer_name in BertTokenizer.pretrained_vocab_files_map["vocab_file"].keys():
tokenizer_p = BertTokenizer.from_pretrained(tokenizer_name)
tokenizer_r = BertTokenizerFast.from_pretrained(tokenizer_name)
def assert_embeded_special_tokens(self, tokenizer_r, tokenizer_p):
sentence = "A, <mask> AllenNLP sentence."
tokens_r = tokenizer_r.encode_plus(
sentence, add_special_tokens=True, return_attention_mask=False, return_token_type_ids=True
)
tokens_p = tokenizer_p.encode_plus(
sentence, add_special_tokens=True, return_attention_mask=False, return_token_type_ids=True
)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
for key in tokens_p.keys():
self.assertEqual(tokens_r[key], tokens_p[key])
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
self.assertEqual(sum(tokens_r["token_type_ids"]), 0)
self.assertEqual(sum(tokens_p["token_type_ids"]), 0)
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"Bert tokenizers doesn't have the same set of special_tokens",
)
tokens_r = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
tokens_p = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
self.assertSequenceEqual(tokens_r, tokens_p)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.0)
def assert_add_special_tokens(self, tokenizer_r):
simple_num_special_tokens_to_add = tokenizer_r.num_special_tokens_to_add(pair=False)
# pair_num_special_tokens_to_add = tokenizer_r.num_special_tokens_to_add(pair=True)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
for text in ["", " "]:
# tokenize()
no_special_tokens = tokenizer_r.tokenize(text, add_special_tokens=False)
with_special_tokens = tokenizer_r.tokenize(text, add_special_tokens=True)
self.assertEqual(len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# encode()
no_special_tokens = tokenizer_r.encode(text, add_special_tokens=False)
with_special_tokens = tokenizer_r.encode(text, add_special_tokens=True)
self.assertEqual(len(no_special_tokens), len(with_special_tokens) - simple_num_special_tokens_to_add)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assert_batch_encode_dynamic_overflowing(tokenizer_r)
# encode_plus()
no_special_tokens = tokenizer_r.encode_plus(text, add_special_tokens=False)
with_special_tokens = tokenizer_r.encode_plus(text, add_special_tokens=True)
for key in no_special_tokens.keys():
self.assertEqual(
len(no_special_tokens[key]), len(with_special_tokens[key]) - simple_num_special_tokens_to_add
)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
# # batch_encode_plus
no_special_tokens = tokenizer_r.batch_encode_plus([text, text], add_special_tokens=False)
with_special_tokens = tokenizer_r.batch_encode_plus([text, text], add_special_tokens=True)
for key in no_special_tokens.keys():
for i_no, i_with in zip(no_special_tokens[key], with_special_tokens[key]):
self.assertEqual(len(i_no), len(i_with) - simple_num_special_tokens_to_add)
# Check the number of returned files for save_vocabulary
self.assert_save_pretrained(tokenizer_r, tokenizer_p)
# Check for padding
self.assert_padding(tokenizer_r, tokenizer_p)
class WordPieceFastTokenizerTest(CommonFastTokenizerTest):
"""
Override all the specific methods to test WordPiece behavior
"""
TOKENIZERS_CLASSES = frozenset(
[
Tokenizer("Bert", BertTokenizerFast, BertTokenizer, "vocab_file", filter_non_english),
Tokenizer("DistilBert", DistilBertTokenizerFast, DistilBertTokenizer, "vocab_file", filter_non_english),
]
)
def fast_only(self, tokenizer_r):
super().fast_only(tokenizer_r)
self.assert_offsets_with_special_characters(tokenizer_r)
def assert_add_special_tokens(self, tokenizer_r):
super().assert_add_special_tokens(tokenizer_r)
def assert_offsets_with_special_characters(self, tokenizer_r):
sentence = "A, naïve [MASK] AllenNLP sentence."
tokens = tokenizer_r.encode_plus(
sentence,
return_attention_mask=False,
return_token_type_ids=False,
return_offsets_mapping=True,
add_special_tokens=True,
)
expected_results = [
((0, 1), "A"),
((1, 2), ","),
((3, 8), "naive"), # BERT normalizes this away
# Append MASK here after lower-casing
((16, 21), "Allen"),
((22, 24), "##NL"),
((24, 25), "##P"),
((26, 34), "sentence"),
((35, 36), "."),
]
# Check if the tokenizer is uncased
if tokenizer_r.init_kwargs.get("do_lower_case"):
expected_results = [(offset, token.lower()) for (offset, token) in expected_results]
# Append the special tokens
expected_results.insert(3, ((9, 15), "[MASK]"))
expected_results.insert(0, (None, "[CLS]"))
expected_results.append((None, "[SEP]"))
self.assertEqual([e[1] for e in expected_results], tokenizer_r.convert_ids_to_tokens(tokens["input_ids"]))
# self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
class RobertaFastTokenizerTest(CommonFastTokenizerTest):
TOKENIZERS_CLASSES = frozenset(
[Tokenizer("Roberta", RobertaTokenizerFast, RobertaTokenizer, "vocab_file", filter_roberta_detectors)]
)
def assert_embeded_special_tokens(self, tokenizer_r, tokenizer_p):
sentence = "A, <mask> AllenNLP sentence."
tokens_r = tokenizer_r.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
tokens_p = tokenizer_p.encode_plus(sentence, add_special_tokens=True, return_token_type_ids=True)
# Rust correctly handles the space before the mask while python doesnt
self.assertSequenceEqual(tokens_r["input_ids"], [0, 83, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
self.assertSequenceEqual(tokens_p["input_ids"], [0, 83, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
# token_type_ids should put 0 everywhere
self.assertEquals(sum(tokens_r["token_type_ids"]), sum(tokens_p["token_type_ids"]))
# attention_mask should put 1 everywhere, so sum over length should be 1
self.assertEquals(
sum(tokens_r["attention_mask"]) / len(tokens_r["attention_mask"]),
sum(tokens_p["attention_mask"]) / len(tokens_p["attention_mask"]),
)
# Rust should have 'Ġ' before <mask> which should be left as an entire token
tokens_r = tokenizer_r.convert_ids_to_tokens(tokens_r["input_ids"])
self.assertSequenceEqual(tokens_r, ["<s>", "ĠA", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"])
class NoPaddingTokenFastTokenizerMatchingTest(CommonFastTokenizerTest):
TOKENIZERS_CLASSES = [
Tokenizer("OpenAI GPT", OpenAIGPTTokenizerFast, OpenAIGPTTokenizer, "vocab_file", None),
Tokenizer("GPT2", GPT2TokenizerFast, GPT2Tokenizer, "vocab_file", None),
]
def assert_padding(self, tokenizer_r, tokenizer_p, max_length=15):
# Simple input
s = "This is a simple input"
s2 = ["This is a simple input 1", "This is a simple input 2"]
p = ("This is a simple input", "This is a pair")
p2 = [
("This is a simple input 1", "This is a simple input 2"),
("This is a simple pair 1", "This is a simple pair 2"),
]
# Simple input tests
self.assertRaises(ValueError, tokenizer_r.encode, s, max_length=max_length, pad_to_max_length=True)
# Simple input
self.assertRaises(ValueError, tokenizer_r.encode_plus, s, max_length=max_length, pad_to_max_length=True)
# Simple input
self.assertRaises(ValueError, tokenizer_r.batch_encode_plus, s2, max_length=max_length, pad_to_max_length=True)
# Pair input
self.assertRaises(ValueError, tokenizer_r.encode, p, max_length=max_length, pad_to_max_length=True)
# Pair input
self.assertRaises(ValueError, tokenizer_r.encode_plus, p, max_length=max_length, pad_to_max_length=True)
# Pair input
self.assertRaises(ValueError, tokenizer_r.batch_encode_plus, p2, max_length=max_length, pad_to_max_length=True)
class TransfoXLFastTokenizerTest(NoPaddingTokenFastTokenizerMatchingTest):
TOKENIZERS_CLASSES = frozenset(
[Tokenizer("TransfoXL", TransfoXLTokenizerFast, TransfoXLTokenizer, "pretrained_vocab_file", None)]
)
@require_torch
def test_transfoxl(self):
for tokenizer_name in TransfoXLTokenizer.pretrained_vocab_files_map["pretrained_vocab_file"].keys():
tokenizer_p = TransfoXLTokenizer.from_pretrained(tokenizer_name)
tokenizer_r = TransfoXLTokenizerFast.from_pretrained(tokenizer_name)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"TransfoXL tokenizers doesn't have the same set of special_tokens",
)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.0)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assertRaises(ValueError, self.assert_batch_encode_dynamic_overflowing, tokenizer_r)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
# Check for padding
self.assertRaises(ValueError, self.assert_padding, tokenizer_r, tokenizer_p)
# Check the number of returned files for save_vocabulary
# TransfoXL tokenizers comes in a special format which is not compatible at all
# with rust tokenizers. We ensure the errors detection at correctly raised
tokenizer_r_files = tokenizer_r.save_pretrained(".")
self.assertSequenceEqual(
tokenizer_r_files, ["./vocab.json", "./special_tokens_map.json", "./added_tokens.json"]
)
# Check loading Python-tokenizer save through Rust doesnt work (and the opposite)
self.assertRaises(ValueError, tokenizer_p.from_pretrained, *tokenizer_r_files)
self.assertRaises(ValueError, tokenizer_r.from_pretrained, *tokenizer_p.save_pretrained("."))
# Check loading works for Python to Python and Rust to Rust
# Issue: https://github.com/huggingface/transformers/issues/3000
# self.assertIsNotNone(tokenizer_p.__class__.from_pretrained('./'))
self.assertIsNotNone(tokenizer_r.__class__.from_pretrained("./"))
def test_distilbert(self):
for tokenizer_name in DistilBertTokenizer.pretrained_vocab_files_map["vocab_file"].keys():
tokenizer_p = DistilBertTokenizer.from_pretrained(tokenizer_name)
tokenizer_r = DistilBertTokenizerFast.from_pretrained(tokenizer_name)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# DistilBert should match 100%
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"DistilBert tokenizers doesn't have the same set of special_tokens",
)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.0)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assert_batch_encode_dynamic_overflowing(tokenizer_r)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
# Check the number of returned files for save_vocabulary
self.assert_save_pretrained(tokenizer_r, tokenizer_p)
# Check for padding
self.assert_padding(tokenizer_r, tokenizer_p)
def test_gpt2(self):
for tokenizer_name in GPT2Tokenizer.pretrained_vocab_files_map["vocab_file"].keys():
tokenizer_p = GPT2Tokenizer.from_pretrained(tokenizer_name)
tokenizer_r = GPT2TokenizerFast.from_pretrained(tokenizer_name)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"GPT2 tokenizers doesn't have the same set of special_tokens",
)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.0)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assertRaises(ValueError, self.assert_batch_encode_dynamic_overflowing, tokenizer_r)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
# Check the number of returned files for save_vocabulary
self.assert_save_pretrained(tokenizer_r, tokenizer_p)
# Check for padding
self.assertRaises(ValueError, self.assert_padding, tokenizer_r, tokenizer_p)
def test_roberta(self):
for tokenizer_name in RobertaTokenizer.pretrained_vocab_files_map["vocab_file"].keys():
tokenizer_p = RobertaTokenizer.from_pretrained(tokenizer_name)
tokenizer_r = RobertaTokenizerFast.from_pretrained(tokenizer_name)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"Roberta tokenizers doesn't have the same set of special_tokens",
)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.01)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assert_batch_encode_dynamic_overflowing(tokenizer_r)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
# Check the number of returned files for save_vocabulary
self.assert_save_pretrained(tokenizer_r, tokenizer_p)
# Check for padding
# TODO: Re-enable this test as soon as Roberta align with the python tokenizer.
# self.assert_padding(tokenizer_r, tokenizer_p)
def test_openai(self):
for tokenizer_name in OpenAIGPTTokenizer.pretrained_vocab_files_map["vocab_file"].keys():
tokenizer_p = OpenAIGPTTokenizer.from_pretrained(tokenizer_name)
tokenizer_r = OpenAIGPTTokenizerFast.from_pretrained(tokenizer_name)
# Check we have the same number of added_tokens for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.num_added_tokens(False), tokenizer_p.num_added_tokens(False))
self.assertEqual(tokenizer_r.num_added_tokens(True), tokenizer_p.num_added_tokens(True))
# Check we have the correct max_length for both pair and non-pair inputs.
self.assertEqual(tokenizer_r.max_len_single_sentence, tokenizer_p.max_len_single_sentence)
self.assertEqual(tokenizer_r.max_len_sentences_pair, tokenizer_p.max_len_sentences_pair)
# Assert the set of special tokens match.
self.assertSequenceEqual(
tokenizer_p.special_tokens_map.items(),
tokenizer_r.special_tokens_map.items(),
"GPT tokenizers doesn't have the same set of special_tokens",
)
# Assure tokenization overlap between python and rust impl.
self.assert_tokenization_python_rust_almost_equals(tokenizer_p, tokenizer_r, 0.0)
# Ensure add_tokens and add_special_tokens return the correct vocab size
self.assert_add_tokens(tokenizer_r)
# Check for offsets mapping
self.assert_offsets_mapping(tokenizer_r)
# Check for dynamic encoding sequence handling in batch_encode_plus
self.assertRaises(ValueError, self.assert_batch_encode_dynamic_overflowing, tokenizer_r)
# Check alignment for build_inputs_with_special_tokens
self.assert_build_inputs_with_special_tokens(tokenizer_r, tokenizer_p)
self.assertEqual(len(tokenizer_r.save_vocabulary(".")), len(tokenizer_p.save_vocabulary(".")))
# Check for padding
self.assertRaises(ValueError, self.assert_padding, tokenizer_r, tokenizer_p)
# Check the number of returned files for save_vocabulary
self.assert_save_pretrained(tokenizer_r, tokenizer_p)
def test_all_tokenizers(self):
super().test_all_tokenizers()

View File

@ -94,7 +94,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
return
tokenizer = self.get_tokenizer()
rust_tokenizer = self.get_rust_tokenizer(add_special_tokens=False, add_prefix_space=True)
rust_tokenizer = self.get_rust_tokenizer(add_prefix_space=True)
sequence = "lower newer"
@ -105,7 +105,7 @@ class GPT2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
# Testing conversion to ids without special tokens
ids = tokenizer.encode(sequence, add_special_tokens=False, add_prefix_space=True)
rust_ids = rust_tokenizer.encode(sequence)
rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
self.assertListEqual(ids, rust_ids)
# Testing conversion to ids with special tokens