[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)
* WIP refactoring pipeline tests - switching to fast tokenizers * fix dialog pipeline and fill-mask * refactoring pipeline tests backbone * make large tests slow * fix tests (tf Bart inactive for now) * fix doc... * clean up for merge * fixing tests - remove bart from summarization until there is TF * fix quality and RAG * Add new translation pipeline tests - fix JAX tests * only slow for dialog * Fixing the missing TF-BART imports in modeling_tf_auto * spin out pipeline tests in separate CI job * adding pipeline test to CI YAML * add slow pipeline tests * speed up tf and pt join test to avoid redoing all the standalone pt and tf tests * Update src/transformers/tokenization_utils_base.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update src/transformers/pipelines.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/testing_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * add require_torch and require_tf in is_pt_tf_cross_test Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
parent
88b3a91e61
commit
3a40cdf58d
|
@ -84,7 +84,7 @@ jobs:
|
|||
key: v0.3-{{ checksum "setup.py" }}
|
||||
paths:
|
||||
- '~/.cache/pip'
|
||||
- run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ --cov --durations=0 | tee output.txt
|
||||
- run: RUN_PT_TF_CROSS_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pt_tf_cross_test --cov --durations=0 | tee output.txt
|
||||
- run: codecov
|
||||
- store_artifacts:
|
||||
path: ~/transformers/output.txt
|
||||
|
@ -164,6 +164,56 @@ jobs:
|
|||
- store_artifacts:
|
||||
path: ~/transformers/output.txt
|
||||
destination: test_output.txt
|
||||
run_tests_pipelines_torch:
|
||||
working_directory: ~/transformers
|
||||
docker:
|
||||
- image: circleci/python:3.7
|
||||
environment:
|
||||
OMP_NUM_THREADS: 1
|
||||
resource_class: xlarge
|
||||
parallelism: 1
|
||||
steps:
|
||||
- checkout
|
||||
- restore_cache:
|
||||
keys:
|
||||
- v0.3-torch-{{ checksum "setup.py" }}
|
||||
- v0.3-{{ checksum "setup.py" }}
|
||||
- run: pip install --upgrade pip
|
||||
- run: pip install git+https://github.com/huggingface/datasets
|
||||
- run: pip install .[sklearn,torch,testing]
|
||||
- save_cache:
|
||||
key: v0.3-torch-{{ checksum "setup.py" }}
|
||||
paths:
|
||||
- '~/.cache/pip'
|
||||
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pipeline_test | tee output.txt
|
||||
- store_artifacts:
|
||||
path: ~/transformers/output.txt
|
||||
destination: test_output.txt
|
||||
run_tests_pipelines_tf:
|
||||
working_directory: ~/transformers
|
||||
docker:
|
||||
- image: circleci/python:3.7
|
||||
environment:
|
||||
OMP_NUM_THREADS: 1
|
||||
resource_class: xlarge
|
||||
parallelism: 1
|
||||
steps:
|
||||
- checkout
|
||||
- restore_cache:
|
||||
keys:
|
||||
- v0.3-tf-{{ checksum "setup.py" }}
|
||||
- v0.3-{{ checksum "setup.py" }}
|
||||
- run: pip install --upgrade pip
|
||||
- run: pip install git+https://github.com/huggingface/datasets
|
||||
- run: pip install .[sklearn,tf-cpu,testing]
|
||||
- save_cache:
|
||||
key: v0.3-tf-{{ checksum "setup.py" }}
|
||||
paths:
|
||||
- '~/.cache/pip'
|
||||
- run: RUN_PIPELINE_TESTS=1 python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ -m is_pipeline_test | tee output.txt
|
||||
- store_artifacts:
|
||||
path: ~/transformers/output.txt
|
||||
destination: test_output.txt
|
||||
run_tests_custom_tokenizers:
|
||||
working_directory: ~/transformers
|
||||
docker:
|
||||
|
@ -331,6 +381,8 @@ workflows:
|
|||
- run_tests_torch
|
||||
- run_tests_tf
|
||||
- run_tests_flax
|
||||
- run_tests_pipelines_torch
|
||||
- run_tests_pipelines_tf
|
||||
- build_doc
|
||||
- deploy_doc: *workflow_filters
|
||||
tpu_testing_jobs:
|
||||
|
|
|
@ -16,52 +16,52 @@ jobs:
|
|||
run_tests_torch_and_tf_gpu:
|
||||
runs-on: [self-hosted, single-gpu]
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
- name: Python version
|
||||
run: |
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Current dir
|
||||
run: pwd
|
||||
- run: nvidia-smi
|
||||
- uses: actions/checkout@v2
|
||||
- name: Python version
|
||||
run: |
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Current dir
|
||||
run: pwd
|
||||
- run: nvidia-smi
|
||||
|
||||
- name: Loading cache.
|
||||
uses: actions/cache@v2
|
||||
id: cache
|
||||
with:
|
||||
path: .env
|
||||
key: v0-tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
|
||||
- name: Loading cache.
|
||||
uses: actions/cache@v2
|
||||
id: cache
|
||||
with:
|
||||
path: .env
|
||||
key: v0-tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
|
||||
|
||||
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
|
||||
run: |
|
||||
python -m venv .env
|
||||
source .env/bin/activate
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install --upgrade pip
|
||||
pip install torch!=1.6.0
|
||||
pip install .[sklearn,testing,onnxruntime]
|
||||
pip install git+https://github.com/huggingface/datasets
|
||||
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
|
||||
run: |
|
||||
python -m venv .env
|
||||
source .env/bin/activate
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install --upgrade pip
|
||||
pip install torch!=1.6.0
|
||||
pip install .[sklearn,testing,onnxruntime]
|
||||
pip install git+https://github.com/huggingface/datasets
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
|
||||
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
|
||||
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
|
||||
|
||||
- name: Run all non-slow tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
# TF_GPU_MEMORY_LIMIT: 4096
|
||||
OMP_NUM_THREADS: 1
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 2 --dist=loadfile -s ./tests/
|
||||
- name: Run all non-slow tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
# TF_GPU_MEMORY_LIMIT: 4096
|
||||
OMP_NUM_THREADS: 1
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 2 --dist=loadfile -s ./tests/
|
||||
|
||||
run_tests_torch_and_tf_multiple_gpu:
|
||||
runs-on: [self-hosted, multi-gpu]
|
||||
|
|
|
@ -12,64 +12,75 @@ jobs:
|
|||
run_all_tests_torch_and_tf_gpu:
|
||||
runs-on: [self-hosted, single-gpu]
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Loading cache.
|
||||
uses: actions/cache@v2
|
||||
id: cache
|
||||
with:
|
||||
path: .env
|
||||
key: v0-slow_tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
|
||||
- name: Loading cache.
|
||||
uses: actions/cache@v2
|
||||
id: cache
|
||||
with:
|
||||
path: .env
|
||||
key: v0-slow_tests_tf_torch_gpu-${{ hashFiles('setup.py') }}
|
||||
|
||||
- name: Python version
|
||||
run: |
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Current dir
|
||||
run: pwd
|
||||
- run: nvidia-smi
|
||||
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
|
||||
if: steps.cache.outputs.cache-hit != 'true'
|
||||
run: |
|
||||
python -m venv .env
|
||||
source .env/bin/activate
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install --upgrade pip
|
||||
pip install torch!=1.6.0
|
||||
pip install .[sklearn,testing,onnxruntime]
|
||||
pip install git+https://github.com/huggingface/datasets
|
||||
- name: Python version
|
||||
run: |
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Current dir
|
||||
run: pwd
|
||||
- run: nvidia-smi
|
||||
- name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
|
||||
if: steps.cache.outputs.cache-hit != 'true'
|
||||
run: |
|
||||
python -m venv .env
|
||||
source .env/bin/activate
|
||||
which python
|
||||
python --version
|
||||
pip --version
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install --upgrade pip
|
||||
pip install torch!=1.6.0
|
||||
pip install .[sklearn,testing,onnxruntime]
|
||||
pip install git+https://github.com/huggingface/datasets
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
|
||||
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
|
||||
python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
|
||||
|
||||
|
||||
- name: Run all tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=50
|
||||
- name: Run all tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=50
|
||||
|
||||
- name: Run examples tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install -r examples/requirements.txt
|
||||
python -m pytest -n 1 --dist=loadfile -s examples --durations=50
|
||||
|
||||
- name: Run all pipeline tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
RUN_PIPELINE_TESTS: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 1 --dist=loadfile -s ./tests/ -m is_pipeline_test --durations=50
|
||||
|
||||
- name: Run examples tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
pip install -r examples/requirements.txt
|
||||
python -m pytest -n 1 --dist=loadfile -s examples --durations=50
|
||||
|
||||
run_all_tests_torch_and_tf_multiple_gpu:
|
||||
runs-on: [self-hosted, multi-gpu]
|
||||
|
@ -131,3 +142,13 @@ jobs:
|
|||
source .env/bin/activate
|
||||
pip install -r examples/requirements.txt
|
||||
python -m pytest -n 1 --dist=loadfile -s examples --durations=50
|
||||
|
||||
- name: Run all pipeline tests on GPU
|
||||
env:
|
||||
TF_FORCE_GPU_ALLOW_GROWTH: "true"
|
||||
OMP_NUM_THREADS: 1
|
||||
RUN_SLOW: yes
|
||||
RUN_PIPELINE_TESTS: yes
|
||||
run: |
|
||||
source .env/bin/activate
|
||||
python -m pytest -n 1 --dist=loadfile -s ./tests/ -m is_pipeline_test --durations=50
|
||||
|
|
|
@ -78,15 +78,16 @@ def convert_slow_checkpoint_to_fast(tokenizer_name, checkpoint_name, dump_path,
|
|||
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
|
||||
)
|
||||
|
||||
file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint]
|
||||
next_char = file_path.split(checkpoint)[-1][0]
|
||||
if next_char == "/":
|
||||
dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name)
|
||||
checkpoint_prefix_name = None
|
||||
if checkpoint in list(tokenizer.pretrained_vocab_files_map.values())[0]:
|
||||
file_path = list(tokenizer.pretrained_vocab_files_map.values())[0][checkpoint]
|
||||
next_char = file_path.split(checkpoint)[-1][0]
|
||||
if next_char == "/":
|
||||
dump_path_full = os.path.join(dump_path_full, checkpoint_prefix_name)
|
||||
checkpoint_prefix_name = None
|
||||
|
||||
logger.info(
|
||||
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
|
||||
)
|
||||
logger.info(
|
||||
"=> {} with prefix {}, add_prefix {}".format(dump_path_full, checkpoint_prefix_name, add_prefix)
|
||||
)
|
||||
|
||||
file_names = tokenizer.save_pretrained(
|
||||
dump_path_full, legacy_format=False, filename_prefix=checkpoint_prefix_name
|
||||
|
|
|
@ -7,11 +7,8 @@ import numpy as np
|
|||
from tqdm import tqdm
|
||||
|
||||
from ...file_utils import is_tf_available, is_torch_available
|
||||
from ...tokenization_bart import BartTokenizer
|
||||
from ...tokenization_bert import whitespace_tokenize
|
||||
from ...tokenization_longformer import LongformerTokenizer
|
||||
from ...tokenization_roberta import RobertaTokenizer
|
||||
from ...tokenization_utils_base import TruncationStrategy
|
||||
from ...tokenization_utils_base import PreTrainedTokenizerBase, TruncationStrategy
|
||||
from ...utils import logging
|
||||
from .utils import DataProcessor
|
||||
|
||||
|
@ -112,7 +109,14 @@ def squad_convert_example_to_features(
|
|||
all_doc_tokens = []
|
||||
for (i, token) in enumerate(example.doc_tokens):
|
||||
orig_to_tok_index.append(len(all_doc_tokens))
|
||||
if isinstance(tokenizer, (RobertaTokenizer, LongformerTokenizer, BartTokenizer)):
|
||||
if tokenizer.__class__.__name__ in [
|
||||
"RobertaTokenizer",
|
||||
"LongformerTokenizer",
|
||||
"BartTokenizer",
|
||||
"RobertaTokenizerFast",
|
||||
"LongformerTokenizerFast",
|
||||
"BartTokenizerFast",
|
||||
]:
|
||||
sub_tokens = tokenizer.tokenize(token, add_prefix_space=True)
|
||||
else:
|
||||
sub_tokens = tokenizer.tokenize(token)
|
||||
|
@ -292,7 +296,7 @@ def squad_convert_example_to_features(
|
|||
return features
|
||||
|
||||
|
||||
def squad_convert_example_to_features_init(tokenizer_for_convert):
|
||||
def squad_convert_example_to_features_init(tokenizer_for_convert: PreTrainedTokenizerBase):
|
||||
global tokenizer
|
||||
tokenizer = tokenizer_for_convert
|
||||
|
||||
|
@ -344,9 +348,9 @@ def squad_convert_examples_to_features(
|
|||
is_training=not evaluate,
|
||||
)
|
||||
"""
|
||||
|
||||
# Defining helper methods
|
||||
features = []
|
||||
|
||||
threads = min(threads, cpu_count())
|
||||
with Pool(threads, initializer=squad_convert_example_to_features_init, initargs=(tokenizer,)) as p:
|
||||
annotate_ = partial(
|
||||
|
@ -365,6 +369,7 @@ def squad_convert_examples_to_features(
|
|||
disable=not tqdm_enabled,
|
||||
)
|
||||
)
|
||||
|
||||
new_features = []
|
||||
unique_id = 1000000000
|
||||
example_index = 0
|
||||
|
|
|
@ -52,7 +52,7 @@ from .modeling_tf_albert import (
|
|||
TFAlbertForTokenClassification,
|
||||
TFAlbertModel,
|
||||
)
|
||||
from .modeling_tf_bart import TFBartForConditionalGeneration
|
||||
from .modeling_tf_bart import TFBartForConditionalGeneration, TFBartModel
|
||||
from .modeling_tf_bert import (
|
||||
TFBertForMaskedLM,
|
||||
TFBertForMultipleChoice,
|
||||
|
@ -163,6 +163,7 @@ TF_MODEL_MAPPING = OrderedDict(
|
|||
(T5Config, TFT5Model),
|
||||
(DistilBertConfig, TFDistilBertModel),
|
||||
(AlbertConfig, TFAlbertModel),
|
||||
(BartConfig, TFBartModel),
|
||||
(CamembertConfig, TFCamembertModel),
|
||||
(XLMRobertaConfig, TFXLMRobertaModel),
|
||||
(LongformerConfig, TFLongformerModel),
|
||||
|
@ -186,6 +187,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
|
|||
(T5Config, TFT5ForConditionalGeneration),
|
||||
(DistilBertConfig, TFDistilBertForMaskedLM),
|
||||
(AlbertConfig, TFAlbertForPreTraining),
|
||||
(BartConfig, TFBartForConditionalGeneration),
|
||||
(CamembertConfig, TFCamembertForMaskedLM),
|
||||
(XLMRobertaConfig, TFXLMRobertaForMaskedLM),
|
||||
(RobertaConfig, TFRobertaForMaskedLM),
|
||||
|
|
|
@ -640,12 +640,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
|
|||
# Load model
|
||||
if pretrained_model_name_or_path is not None:
|
||||
if os.path.isdir(pretrained_model_name_or_path):
|
||||
if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
|
||||
if from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
|
||||
# Load from a PyTorch checkpoint in priority if from_pt
|
||||
archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
|
||||
elif os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
|
||||
# Load from a TF 2.0 checkpoint
|
||||
archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)
|
||||
elif from_pt and os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
|
||||
# Load from a PyTorch checkpoint
|
||||
archive_file = os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)
|
||||
else:
|
||||
raise EnvironmentError(
|
||||
"Error no file named {} found in directory {} or `from_pt` set to False".format(
|
||||
|
|
|
@ -882,10 +882,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
|
|||
if pretrained_model_name_or_path is not None:
|
||||
if os.path.isdir(pretrained_model_name_or_path):
|
||||
if from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")):
|
||||
# Load from a TF 1.0 checkpoint
|
||||
# Load from a TF 1.0 checkpoint in priority if from_tf
|
||||
archive_file = os.path.join(pretrained_model_name_or_path, TF_WEIGHTS_NAME + ".index")
|
||||
elif from_tf and os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
|
||||
# Load from a TF 2.0 checkpoint
|
||||
# Load from a TF 2.0 checkpoint in priority if from_tf
|
||||
archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)
|
||||
elif os.path.isfile(os.path.join(pretrained_model_name_or_path, WEIGHTS_NAME)):
|
||||
# Load from a PyTorch checkpoint
|
||||
|
@ -951,7 +951,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
|
|||
state_dict = torch.load(resolved_archive_file, map_location="cpu")
|
||||
except Exception:
|
||||
raise OSError(
|
||||
"Unable to load weights from pytorch checkpoint file. "
|
||||
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
|
||||
f"at '{resolved_archive_file}'"
|
||||
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
|
||||
)
|
||||
|
||||
|
|
|
@ -38,7 +38,7 @@ from .modelcard import ModelCard
|
|||
from .tokenization_auto import AutoTokenizer
|
||||
from .tokenization_bert import BasicTokenizer
|
||||
from .tokenization_utils import PreTrainedTokenizer
|
||||
from .tokenization_utils_base import BatchEncoding, PaddingStrategy
|
||||
from .tokenization_utils_base import PaddingStrategy
|
||||
from .utils import logging
|
||||
|
||||
|
||||
|
@ -2396,11 +2396,12 @@ class ConversationalPipeline(Pipeline):
|
|||
|
||||
def __init__(self, min_length_for_response=32, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
# We need at least an eos_token
|
||||
assert self.tokenizer.eos_token_id is not None, "DialoguePipeline tokenizer should have an EOS token set"
|
||||
if self.tokenizer.pad_token_id is not None:
|
||||
self.pad_token_id = self.tokenizer.pad_token_id
|
||||
else:
|
||||
self.pad_token_id = self.tokenizer.eos_token_id
|
||||
if self.tokenizer.pad_token_id is None:
|
||||
self.tokenizer.pad_token = self.tokenizer.eos_token
|
||||
|
||||
self.min_length_for_response = min_length_for_response
|
||||
|
||||
def __call__(
|
||||
|
@ -2496,7 +2497,7 @@ class ConversationalPipeline(Pipeline):
|
|||
"""
|
||||
# Parse arguments
|
||||
inputs = self._args_parser(*args, **kwargs)
|
||||
inputs = self.tokenizer.batch_encode_plus(inputs, add_special_tokens=False, padding=False).get("input_ids", [])
|
||||
inputs = self.tokenizer(inputs, add_special_tokens=False, padding=False).get("input_ids", [])
|
||||
for input in inputs:
|
||||
input.append(self.tokenizer.eos_token_id)
|
||||
return inputs
|
||||
|
@ -2516,7 +2517,7 @@ class ConversationalPipeline(Pipeline):
|
|||
sequence_tokens = []
|
||||
is_previous_pad = False
|
||||
for token in sequence:
|
||||
if token == self.pad_token_id:
|
||||
if token == self.tokenizer.pad_token_id:
|
||||
if is_previous_pad:
|
||||
continue
|
||||
else:
|
||||
|
@ -2550,13 +2551,10 @@ class ConversationalPipeline(Pipeline):
|
|||
else:
|
||||
new_input = new_input[cutoff_eos_index + 1 :]
|
||||
outputs.append(new_input)
|
||||
max_len = max([len(item) for item in outputs])
|
||||
outputs = [output + [self.pad_token_id] * (max_len - len(output)) for output in outputs]
|
||||
outputs = BatchEncoding(
|
||||
{"input_ids": outputs, "attention_mask": [[1] * len(outputs)]},
|
||||
tensor_type=self.framework,
|
||||
padded_outputs = self.tokenizer.pad(
|
||||
{"input_ids": outputs}, padding="longest", return_attention_mask=True, return_tensors=self.framework
|
||||
)
|
||||
return outputs
|
||||
return padded_outputs
|
||||
|
||||
|
||||
# Register all the supported tasks here
|
||||
|
@ -2700,6 +2698,7 @@ def pipeline(
|
|||
config: Optional[Union[str, PretrainedConfig]] = None,
|
||||
tokenizer: Optional[Union[str, PreTrainedTokenizer]] = None,
|
||||
framework: Optional[str] = None,
|
||||
use_fast: bool = False,
|
||||
**kwargs
|
||||
) -> Pipeline:
|
||||
"""
|
||||
|
@ -2749,6 +2748,8 @@ def pipeline(
|
|||
If no framework is specified, will default to the one currently installed. If no framework is specified
|
||||
and both frameworks are installed, will default to the framework of the :obj:`model`, or to PyTorch if no
|
||||
model is provided.
|
||||
use_fast (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to use a Fast tokenizer if possible (a :class:`~transformers.PreTrainedTokenizerFast`).
|
||||
kwargs:
|
||||
Additional keyword arguments passed along to the specific pipeline init (see the documentation for the
|
||||
corresponding pipeline class for possible values).
|
||||
|
@ -2807,9 +2808,10 @@ def pipeline(
|
|||
if isinstance(tokenizer, (str, tuple)):
|
||||
if isinstance(tokenizer, tuple):
|
||||
# For tuple we have (tokenizer name, {kwargs})
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], **tokenizer[1])
|
||||
use_fast = tokenizer[1].pop("use_fast", use_fast)
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer[0], use_fast=use_fast, **tokenizer[1])
|
||||
else:
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer)
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer, use_fast=use_fast)
|
||||
|
||||
# Instantiate config if needed
|
||||
if isinstance(config, str):
|
||||
|
|
|
@ -59,10 +59,50 @@ def parse_int_from_env(key, default=None):
|
|||
|
||||
|
||||
_run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False)
|
||||
_run_pt_tf_cross_tests = parse_flag_from_env("RUN_PT_TF_CROSS_TESTS", default=False)
|
||||
_run_custom_tokenizers = parse_flag_from_env("RUN_CUSTOM_TOKENIZERS", default=False)
|
||||
_run_pipeline_tests = parse_flag_from_env("RUN_PIPELINE_TESTS", default=False)
|
||||
_tf_gpu_memory_limit = parse_int_from_env("TF_GPU_MEMORY_LIMIT", default=None)
|
||||
|
||||
|
||||
def is_pt_tf_cross_test(test_case):
|
||||
"""
|
||||
Decorator marking a test as a test that control interactions between PyTorch and TensorFlow.
|
||||
|
||||
PT+TF tests are skipped by default and we can run only them by setting RUN_PT_TF_CROSS_TESTS environment variable
|
||||
to a truthy value and selecting the is_pt_tf_cross_test pytest mark.
|
||||
|
||||
"""
|
||||
if not _run_pt_tf_cross_tests or not _torch_available or not _tf_available:
|
||||
return unittest.skip("test is PT+TF test")(test_case)
|
||||
else:
|
||||
try:
|
||||
import pytest # We don't need a hard dependency on pytest in the main library
|
||||
except ImportError:
|
||||
return test_case
|
||||
else:
|
||||
return pytest.mark.is_pt_tf_cross_test()(test_case)
|
||||
|
||||
|
||||
def is_pipeline_test(test_case):
|
||||
"""
|
||||
Decorator marking a test as a pipeline test.
|
||||
|
||||
Pipeline tests are skipped by default and we can run only them by setting RUN_PIPELINE_TEST environment variable
|
||||
to a truthy value and selecting the is_pipeline_test pytest mark.
|
||||
|
||||
"""
|
||||
if not _run_pipeline_tests:
|
||||
return unittest.skip("test is pipeline test")(test_case)
|
||||
else:
|
||||
try:
|
||||
import pytest # We don't need a hard dependency on pytest in the main library
|
||||
except ImportError:
|
||||
return test_case
|
||||
else:
|
||||
return pytest.mark.is_pipeline_test()(test_case)
|
||||
|
||||
|
||||
def slow(test_case):
|
||||
"""
|
||||
Decorator marking a test as slow.
|
||||
|
|
|
@ -136,18 +136,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
|
|||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def get_vocab(self) -> Dict[str, int]:
|
||||
"""
|
||||
Returns the vocabulary as a dictionary of token to index.
|
||||
|
||||
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
|
||||
:obj:`token` is in the vocab.
|
||||
|
||||
Returns:
|
||||
:obj:`Dict[str, int]`: The vocabulary.
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
def get_added_vocab(self) -> Dict[str, int]:
|
||||
"""
|
||||
Returns the added tokens in the vocabulary as a dictionary of token to index.
|
||||
|
@ -733,47 +721,15 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
|
|||
raise NotImplementedError
|
||||
|
||||
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
||||
"""
|
||||
Converts a sequence of token ids in a single string.
|
||||
|
||||
The most simple way to do it is ``" ".join(tokens)`` but we often want to remove
|
||||
sub-word tokenization artifacts at the same time.
|
||||
|
||||
Args:
|
||||
tokens (:obj:`List[str]`): The token to join in a string.
|
||||
|
||||
Return: The joined tokens.
|
||||
"""
|
||||
return " ".join(tokens)
|
||||
|
||||
def decode(
|
||||
def _decode(
|
||||
self,
|
||||
token_ids: List[int],
|
||||
skip_special_tokens: bool = False,
|
||||
clean_up_tokenization_spaces: bool = True,
|
||||
spaces_between_special_tokens: bool = True,
|
||||
) -> str:
|
||||
"""
|
||||
Converts a sequence of ids in a string, using the tokenizer and vocabulary
|
||||
with options to remove special tokens and clean up tokenization spaces.
|
||||
|
||||
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
|
||||
|
||||
Args:
|
||||
token_ids (:obj:`List[int]`):
|
||||
List of tokenized input ids. Can be obtained using the ``__call__`` method.
|
||||
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to remove special tokens in the decoding.
|
||||
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to clean up the tokenization spaces.
|
||||
spaces_between_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to add spaces around special tokens.
|
||||
The behavior of Fast tokenizers is to have this to :obj:`False`.
|
||||
This is setup to :obj:`True` in slow tokenizers for backward compatibility.
|
||||
|
||||
Returns:
|
||||
:obj:`str`: The decoded sentence.
|
||||
"""
|
||||
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
|
||||
|
||||
# To avoid mixing byte-level and unicode for byte-level BPT
|
||||
|
|
|
@ -175,6 +175,23 @@ class TokenSpan(NamedTuple):
|
|||
end: int
|
||||
|
||||
|
||||
def to_py_obj(obj):
|
||||
"""
|
||||
Convert a TensorFlow tensor, PyTorch tensor, Numpy array or python list
|
||||
to a python list.
|
||||
"""
|
||||
if isinstance(obj, (list, tuple)):
|
||||
return [to_py_obj(o) for o in obj]
|
||||
elif is_tf_available() and isinstance(obj, tf.Tensor):
|
||||
return obj.numpy().tolist()
|
||||
elif is_torch_available() and isinstance(obj, torch.Tensor):
|
||||
return obj.detach().cpu().tolist()
|
||||
elif isinstance(obj, np.ndarray):
|
||||
return obj.tolist()
|
||||
else:
|
||||
return obj
|
||||
|
||||
|
||||
class BatchEncoding(UserDict):
|
||||
"""
|
||||
Holds the output of the :meth:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode_plus`
|
||||
|
@ -1025,6 +1042,38 @@ class SpecialTokensMixin:
|
|||
"""
|
||||
return self.convert_tokens_to_ids(self.additional_special_tokens)
|
||||
|
||||
@bos_token_id.setter
|
||||
def bos_token_id(self, value):
|
||||
self._bos_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@eos_token_id.setter
|
||||
def eos_token_id(self, value):
|
||||
self._eos_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@unk_token_id.setter
|
||||
def unk_token_id(self, value):
|
||||
self._unk_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@sep_token_id.setter
|
||||
def sep_token_id(self, value):
|
||||
self._sep_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@pad_token_id.setter
|
||||
def pad_token_id(self, value):
|
||||
self._pad_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@cls_token_id.setter
|
||||
def cls_token_id(self, value):
|
||||
self._cls_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@mask_token_id.setter
|
||||
def mask_token_id(self, value):
|
||||
self._mask_token = self.convert_tokens_to_ids(value)
|
||||
|
||||
@additional_special_tokens_ids.setter
|
||||
def additional_special_tokens_ids(self, values):
|
||||
self._additional_special_tokens = [self.convert_tokens_to_ids(value) for value in values]
|
||||
|
||||
@property
|
||||
def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
|
||||
"""
|
||||
|
@ -1424,6 +1473,18 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
|
|||
f"padding_side='{self.padding_side}', special_tokens={self.special_tokens_map_extended})"
|
||||
)
|
||||
|
||||
def get_vocab(self) -> Dict[str, int]:
|
||||
"""
|
||||
Returns the vocabulary as a dictionary of token to index.
|
||||
|
||||
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
|
||||
:obj:`token` is in the vocab.
|
||||
|
||||
Returns:
|
||||
:obj:`Dict[str, int]`: The vocabulary.
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs):
|
||||
r"""
|
||||
|
@ -1852,6 +1913,32 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
|
|||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
|
||||
"""
|
||||
Converts a string in a sequence of tokens, using the backend Rust tokenizer.
|
||||
|
||||
Note that this method behave differently between fast and slow tokenizers:
|
||||
- in fast tokenizers (instances of :class:`~transformers.PreTrainedTokenizerFast`), this method
|
||||
will replace the unknown tokens with the :obj:`unk_token`,
|
||||
- in slow tokenizers (instances of :class:`~transformers.PreTrainedTokenizer`), this method
|
||||
keep unknown tokens unchanged.
|
||||
|
||||
Args:
|
||||
text (:obj:`str`):
|
||||
The sequence to be encoded.
|
||||
pair (:obj:`str`, `optional`):
|
||||
A second sequence to be encoded with the first.
|
||||
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to add the special tokens associated with the corresponding model.
|
||||
kwargs (additional keyword arguments, `optional`):
|
||||
Will be passed to the underlying model specific encode method.
|
||||
See details in :meth:`~transformers.PreTrainedTokenizer.__call__`
|
||||
|
||||
Returns:
|
||||
:obj:`List[str]`: The list of tokens.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@add_end_docstrings(
|
||||
ENCODE_KWARGS_DOCSTRING,
|
||||
"""
|
||||
|
@ -2456,18 +2543,6 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
|
|||
f"Should be one of a python, numpy, pytorch or tensorflow object."
|
||||
)
|
||||
|
||||
def to_py_obj(obj):
|
||||
if isinstance(obj, (list, tuple)):
|
||||
return [to_py_obj(o) for o in obj]
|
||||
elif is_tf_available() and isinstance(obj, tf.Tensor):
|
||||
return obj.numpy().tolist()
|
||||
elif is_torch_available() and isinstance(obj, torch.Tensor):
|
||||
return obj.cpu().tolist()
|
||||
elif isinstance(obj, np.ndarray):
|
||||
return obj.tolist()
|
||||
else:
|
||||
return obj
|
||||
|
||||
for key, value in encoded_inputs.items():
|
||||
encoded_inputs[key] = to_py_obj(value)
|
||||
|
||||
|
@ -2862,33 +2937,53 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
|
|||
|
||||
return encoded_inputs
|
||||
|
||||
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
||||
"""
|
||||
Converts a sequence of token ids in a single string.
|
||||
The most simple way to do it is ``" ".join(tokens)`` but we often want to remove
|
||||
sub-word tokenization artifacts at the same time.
|
||||
Args:
|
||||
tokens (:obj:`List[str]`): The token to join in a string.
|
||||
Return: The joined tokens.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def batch_decode(
|
||||
self, sequences: List[List[int]], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True
|
||||
self,
|
||||
sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
|
||||
skip_special_tokens: bool = False,
|
||||
clean_up_tokenization_spaces: bool = True,
|
||||
**kwargs
|
||||
) -> List[str]:
|
||||
"""
|
||||
Convert a list of lists of token ids into a list of strings by calling decode.
|
||||
|
||||
Args:
|
||||
sequences (:obj:`List[List[int]]`):
|
||||
sequences (:obj:`Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]`):
|
||||
List of tokenized input ids. Can be obtained using the ``__call__`` method.
|
||||
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to remove special tokens in the decoding.
|
||||
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to clean up the tokenization spaces.
|
||||
kwargs (additional keyword arguments, `optional`):
|
||||
Will be passed to the underlying model specific decode method.
|
||||
|
||||
Returns:
|
||||
:obj:`List[str]`: The list of decoded sentences.
|
||||
"""
|
||||
return [
|
||||
self.decode(
|
||||
seq, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces
|
||||
seq,
|
||||
skip_special_tokens=skip_special_tokens,
|
||||
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
||||
**kwargs,
|
||||
)
|
||||
for seq in sequences
|
||||
]
|
||||
|
||||
def decode(
|
||||
self,
|
||||
token_ids: List[int],
|
||||
token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
|
||||
skip_special_tokens: bool = False,
|
||||
clean_up_tokenization_spaces: bool = True,
|
||||
**kwargs
|
||||
|
@ -2900,16 +2995,35 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
|
|||
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
|
||||
|
||||
Args:
|
||||
token_ids (:obj:`List[int]`):
|
||||
token_ids (:obj:`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
|
||||
List of tokenized input ids. Can be obtained using the ``__call__`` method.
|
||||
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to remove special tokens in the decoding.
|
||||
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to clean up the tokenization spaces.
|
||||
kwargs (additional keyword arguments, `optional`):
|
||||
Will be passed to the underlying model specific decode method.
|
||||
|
||||
Returns:
|
||||
:obj:`str`: The decoded sentence.
|
||||
"""
|
||||
# Convert inputs to python lists
|
||||
token_ids = to_py_obj(token_ids)
|
||||
|
||||
return self._decode(
|
||||
token_ids=token_ids,
|
||||
skip_special_tokens=skip_special_tokens,
|
||||
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def _decode(
|
||||
self,
|
||||
token_ids: Union[int, List[int]],
|
||||
skip_special_tokens: bool = False,
|
||||
clean_up_tokenization_spaces: bool = True,
|
||||
**kwargs
|
||||
) -> str:
|
||||
raise NotImplementedError
|
||||
|
||||
def get_special_tokens_mask(
|
||||
|
|
|
@ -122,17 +122,12 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
|
|||
return self._tokenizer.get_vocab_size(with_added_tokens=False)
|
||||
|
||||
def get_vocab(self) -> Dict[str, int]:
|
||||
"""
|
||||
Returns the vocabulary as a dictionary of token to index.
|
||||
|
||||
:obj:`tokenizer.get_vocab()[token]` is equivalent to :obj:`tokenizer.convert_tokens_to_ids(token)` when
|
||||
:obj:`token` is in the vocab.
|
||||
|
||||
Returns:
|
||||
:obj:`Dict[str, int]`: The vocabulary.
|
||||
"""
|
||||
return self._tokenizer.get_vocab(with_added_tokens=True)
|
||||
|
||||
@property
|
||||
def vocab(self) -> Dict[str, int]:
|
||||
return self.get_vocab()
|
||||
|
||||
def get_added_vocab(self) -> Dict[str, int]:
|
||||
"""
|
||||
Returns the added tokens in the vocabulary as a dictionary of token to index.
|
||||
|
@ -291,25 +286,8 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
|
|||
tokens.append(self._tokenizer.id_to_token(index))
|
||||
return tokens
|
||||
|
||||
def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False) -> List[str]:
|
||||
"""
|
||||
Converts a string in a sequence of tokens, using the backend Rust tokenizer.
|
||||
|
||||
Note that, unlike slow tokenizers (instances of :class:`~transformers.PreTrainedTokenizer`), this method
|
||||
will replace the unknown tokens with the :obj:`unk_token`.
|
||||
|
||||
Args:
|
||||
text (:obj:`str`):
|
||||
The sequence to be encoded.
|
||||
pair (:obj:`str`, `optional`):
|
||||
A second sequence to be encoded with the first.
|
||||
add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to add the special tokens associated with the corresponding model.
|
||||
|
||||
Returns:
|
||||
:obj:`List[str]`: The list of tokens.
|
||||
"""
|
||||
return self._tokenizer.encode(text, pair, add_special_tokens=add_special_tokens).tokens
|
||||
def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
|
||||
return self.encode_plus(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
|
||||
|
||||
def set_truncation_and_padding(
|
||||
self,
|
||||
|
@ -405,29 +383,11 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
|
|||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
)
|
||||
|
||||
# Avoid thread overhead if only one example.
|
||||
if len(batch_text_or_text_pairs) == 1:
|
||||
if isinstance(batch_text_or_text_pairs[0], tuple):
|
||||
# We got a Tuple with a pair of sequences
|
||||
encodings = self._tokenizer.encode(
|
||||
*batch_text_or_text_pairs[0],
|
||||
add_special_tokens=add_special_tokens,
|
||||
is_pretokenized=is_split_into_words,
|
||||
)
|
||||
else:
|
||||
# We got a single sequence
|
||||
encodings = self._tokenizer.encode(
|
||||
batch_text_or_text_pairs[0],
|
||||
add_special_tokens=add_special_tokens,
|
||||
is_pretokenized=is_split_into_words,
|
||||
)
|
||||
encodings = [encodings]
|
||||
else:
|
||||
encodings = self._tokenizer.encode_batch(
|
||||
batch_text_or_text_pairs,
|
||||
add_special_tokens=add_special_tokens,
|
||||
is_pretokenized=is_split_into_words,
|
||||
)
|
||||
encodings = self._tokenizer.encode_batch(
|
||||
batch_text_or_text_pairs,
|
||||
add_special_tokens=add_special_tokens,
|
||||
is_pretokenized=is_split_into_words,
|
||||
)
|
||||
|
||||
# Convert encoding to dict
|
||||
# `Tokens` has type: List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]]
|
||||
|
@ -525,30 +485,16 @@ class PreTrainedTokenizerFast(PreTrainedTokenizerBase):
|
|||
|
||||
return batched_output
|
||||
|
||||
def decode(
|
||||
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
||||
return self.backend_tokenizer.decoder.decode(tokens)
|
||||
|
||||
def _decode(
|
||||
self,
|
||||
token_ids: Union[int, List[int]],
|
||||
skip_special_tokens: bool = False,
|
||||
clean_up_tokenization_spaces: bool = True,
|
||||
**kwargs
|
||||
) -> str:
|
||||
"""
|
||||
Converts a sequence of ids in a string, using the tokenizer and vocabulary
|
||||
with options to remove special tokens and clean up tokenization spaces.
|
||||
|
||||
Similar to doing ``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``.
|
||||
|
||||
Args:
|
||||
token_ids (:obj:`Union[int, List[int]]`):
|
||||
List of tokenized input ids. Can be obtained using the ``__call__`` method.
|
||||
skip_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether or not to remove special tokens in the decoding.
|
||||
clean_up_tokenization_spaces (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to clean up the tokenization spaces.
|
||||
|
||||
Returns:
|
||||
:obj:`str`: The decoded sentence.
|
||||
"""
|
||||
if isinstance(token_ids, int):
|
||||
token_ids = [token_ids]
|
||||
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
|
||||
|
|
|
@ -15,3 +15,10 @@ sys.path.insert(1, git_repo_path)
|
|||
# silence FutureWarning warnings in tests since often we can't act on them until
|
||||
# they become normal warnings - i.e. the tests still need to test the current functionality
|
||||
warnings.simplefilter(action="ignore", category=FutureWarning)
|
||||
|
||||
|
||||
def pytest_configure(config):
|
||||
config.addinivalue_line("markers", "is_pipeline_test: mark test to run only when pipeline are tested")
|
||||
config.addinivalue_line(
|
||||
"markers", "is_pt_tf_cross_test: mark test to run only when PT and TF interactions are tested"
|
||||
)
|
||||
|
|
|
@ -19,7 +19,7 @@ import unittest
|
|||
|
||||
from transformers import is_tf_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_tf, require_torch, slow
|
||||
from transformers.testing_utils import is_pt_tf_cross_test, require_tf, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
|
@ -231,8 +231,7 @@ def _long_tensor(tok_lst):
|
|||
TOLERANCE = 1e-4
|
||||
|
||||
|
||||
@require_tf
|
||||
@require_torch
|
||||
@is_pt_tf_cross_test
|
||||
@slow
|
||||
class TFBartModelIntegrationTest(unittest.TestCase):
|
||||
def test_inference_no_head(self):
|
||||
|
|
|
@ -23,8 +23,8 @@ import unittest
|
|||
from importlib import import_module
|
||||
from typing import List, Tuple
|
||||
|
||||
from transformers import is_tf_available, is_torch_available
|
||||
from transformers.testing_utils import _tf_gpu_memory_limit, require_tf, slow
|
||||
from transformers import is_tf_available
|
||||
from transformers.testing_utils import _tf_gpu_memory_limit, is_pt_tf_cross_test, require_tf, slow
|
||||
|
||||
|
||||
if is_tf_available():
|
||||
|
@ -291,9 +291,8 @@ class TFModelTesterMixin:
|
|||
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||
self.assertLessEqual(max_diff, 1e-5)
|
||||
|
||||
@is_pt_tf_cross_test
|
||||
def test_pt_tf_model_equivalence(self):
|
||||
if not is_torch_available():
|
||||
return
|
||||
|
||||
import torch
|
||||
|
||||
|
|
|
@ -0,0 +1,243 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2018 The Google AI Language Team Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import is_tf_available, is_torch_available
|
||||
from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, is_pt_tf_cross_test, slow
|
||||
|
||||
|
||||
if is_tf_available():
|
||||
from transformers import (
|
||||
AutoConfig,
|
||||
BertConfig,
|
||||
GPT2Config,
|
||||
T5Config,
|
||||
TFAutoModel,
|
||||
TFAutoModelForCausalLM,
|
||||
TFAutoModelForMaskedLM,
|
||||
TFAutoModelForPreTraining,
|
||||
TFAutoModelForQuestionAnswering,
|
||||
TFAutoModelForSeq2SeqLM,
|
||||
TFAutoModelForSequenceClassification,
|
||||
TFAutoModelWithLMHead,
|
||||
TFBertForMaskedLM,
|
||||
TFBertForPreTraining,
|
||||
TFBertForQuestionAnswering,
|
||||
TFBertForSequenceClassification,
|
||||
TFBertModel,
|
||||
TFGPT2LMHeadModel,
|
||||
TFRobertaForMaskedLM,
|
||||
TFT5ForConditionalGeneration,
|
||||
)
|
||||
from transformers.modeling_tf_bert import TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
from transformers.modeling_tf_gpt2 import TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
from transformers.modeling_tf_t5 import TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
if is_torch_available():
|
||||
from transformers import (
|
||||
AutoModel,
|
||||
AutoModelForCausalLM,
|
||||
AutoModelForMaskedLM,
|
||||
AutoModelForPreTraining,
|
||||
AutoModelForQuestionAnswering,
|
||||
AutoModelForSeq2SeqLM,
|
||||
AutoModelForSequenceClassification,
|
||||
AutoModelWithLMHead,
|
||||
BertForMaskedLM,
|
||||
BertForPreTraining,
|
||||
BertForQuestionAnswering,
|
||||
BertForSequenceClassification,
|
||||
BertModel,
|
||||
GPT2LMHeadModel,
|
||||
RobertaForMaskedLM,
|
||||
T5ForConditionalGeneration,
|
||||
)
|
||||
|
||||
|
||||
@is_pt_tf_cross_test
|
||||
class TFPTAutoModelTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
import h5py
|
||||
|
||||
self.assertTrue(h5py.version.hdf5_version.startswith("1.10"))
|
||||
|
||||
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
for model_name in ["bert-base-uncased"]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModel.from_pretrained(model_name, from_pt=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertModel)
|
||||
|
||||
model = AutoModel.from_pretrained(model_name, from_tf=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertModel)
|
||||
|
||||
@slow
|
||||
def test_model_for_pretraining_from_pretrained(self):
|
||||
import h5py
|
||||
|
||||
self.assertTrue(h5py.version.hdf5_version.startswith("1.10"))
|
||||
|
||||
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
for model_name in ["bert-base-uncased"]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModelForPreTraining.from_pretrained(model_name, from_pt=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertForPreTraining)
|
||||
|
||||
model = AutoModelForPreTraining.from_pretrained(model_name, from_tf=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForPreTraining)
|
||||
|
||||
@slow
|
||||
def test_model_for_causal_lm(self):
|
||||
for model_name in TF_GPT2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, GPT2Config)
|
||||
|
||||
model = TFAutoModelForCausalLM.from_pretrained(model_name, from_pt=True)
|
||||
model, loading_info = TFAutoModelForCausalLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_pt=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFGPT2LMHeadModel)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, from_tf=True)
|
||||
model, loading_info = AutoModelForCausalLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_tf=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, GPT2LMHeadModel)
|
||||
|
||||
@slow
|
||||
def test_lmhead_model_from_pretrained(self):
|
||||
for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModelWithLMHead.from_pretrained(model_name, from_pt=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertForMaskedLM)
|
||||
|
||||
model = AutoModelWithLMHead.from_pretrained(model_name, from_tf=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForMaskedLM)
|
||||
|
||||
@slow
|
||||
def test_model_for_masked_lm(self):
|
||||
for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModelForMaskedLM.from_pretrained(model_name, from_pt=True)
|
||||
model, loading_info = TFAutoModelForMaskedLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_pt=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertForMaskedLM)
|
||||
|
||||
model = AutoModelForMaskedLM.from_pretrained(model_name, from_tf=True)
|
||||
model, loading_info = AutoModelForMaskedLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_tf=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForMaskedLM)
|
||||
|
||||
@slow
|
||||
def test_model_for_encoder_decoder_lm(self):
|
||||
for model_name in TF_T5_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, T5Config)
|
||||
|
||||
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name, from_pt=True)
|
||||
model, loading_info = TFAutoModelForSeq2SeqLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_pt=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFT5ForConditionalGeneration)
|
||||
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, from_tf=True)
|
||||
model, loading_info = AutoModelForSeq2SeqLM.from_pretrained(
|
||||
model_name, output_loading_info=True, from_tf=True
|
||||
)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, T5ForConditionalGeneration)
|
||||
|
||||
@slow
|
||||
def test_sequence_classification_model_from_pretrained(self):
|
||||
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
for model_name in ["bert-base-uncased"]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertForSequenceClassification)
|
||||
|
||||
model = AutoModelForSequenceClassification.from_pretrained(model_name, from_tf=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForSequenceClassification)
|
||||
|
||||
@slow
|
||||
def test_question_answering_model_from_pretrained(self):
|
||||
# for model_name in TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
for model_name in ["bert-base-uncased"]:
|
||||
config = AutoConfig.from_pretrained(model_name)
|
||||
self.assertIsNotNone(config)
|
||||
self.assertIsInstance(config, BertConfig)
|
||||
|
||||
model = TFAutoModelForQuestionAnswering.from_pretrained(model_name, from_pt=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, TFBertForQuestionAnswering)
|
||||
|
||||
model = AutoModelForQuestionAnswering.from_pretrained(model_name, from_tf=True)
|
||||
self.assertIsNotNone(model)
|
||||
self.assertIsInstance(model, BertForQuestionAnswering)
|
||||
|
||||
def test_from_pretrained_identifier(self):
|
||||
model = TFAutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER, from_pt=True)
|
||||
self.assertIsInstance(model, TFBertForMaskedLM)
|
||||
self.assertEqual(model.num_parameters(), 14830)
|
||||
self.assertEqual(model.num_parameters(only_trainable=True), 14830)
|
||||
|
||||
model = AutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER, from_tf=True)
|
||||
self.assertIsInstance(model, BertForMaskedLM)
|
||||
self.assertEqual(model.num_parameters(), 14410)
|
||||
self.assertEqual(model.num_parameters(only_trainable=True), 14410)
|
||||
|
||||
def test_from_identifier_from_model_type(self):
|
||||
model = TFAutoModelWithLMHead.from_pretrained(DUMMY_UNKWOWN_IDENTIFIER, from_pt=True)
|
||||
self.assertIsInstance(model, TFRobertaForMaskedLM)
|
||||
self.assertEqual(model.num_parameters(), 14830)
|
||||
self.assertEqual(model.num_parameters(only_trainable=True), 14830)
|
||||
|
||||
model = AutoModelWithLMHead.from_pretrained(DUMMY_UNKWOWN_IDENTIFIER, from_tf=True)
|
||||
self.assertIsInstance(model, RobertaForMaskedLM)
|
||||
self.assertEqual(model.num_parameters(), 14410)
|
||||
self.assertEqual(model.num_parameters(only_trainable=True), 14410)
|
|
@ -1,869 +0,0 @@
|
|||
import unittest
|
||||
from typing import Iterable, List, Optional
|
||||
|
||||
import pytest
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.pipelines import SUPPORTED_TASKS, Conversation, DefaultArgumentHandler, Pipeline
|
||||
from transformers.testing_utils import require_tf, require_tokenizers, require_torch, slow, torch_device
|
||||
|
||||
|
||||
DEFAULT_DEVICE_NUM = -1 if torch_device == "cpu" else 0
|
||||
VALID_INPUTS = ["A simple string", ["list of strings"]]
|
||||
|
||||
NER_FINETUNED_MODELS = ["sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english"]
|
||||
TF_NER_FINETUNED_MODELS = ["Narsil/small"]
|
||||
|
||||
# xlnet-base-cased disabled for now, since it crashes TF2
|
||||
FEATURE_EXTRACT_FINETUNED_MODELS = ["sshleifer/tiny-distilbert-base-cased"]
|
||||
TEXT_CLASSIF_FINETUNED_MODELS = ["sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english"]
|
||||
TEXT_GENERATION_FINETUNED_MODELS = ["sshleifer/tiny-ctrl"]
|
||||
|
||||
FILL_MASK_FINETUNED_MODELS = ["sshleifer/tiny-distilroberta-base"]
|
||||
LARGE_FILL_MASK_FINETUNED_MODELS = ["distilroberta-base"] # @slow
|
||||
|
||||
SUMMARIZATION_FINETUNED_MODELS = ["sshleifer/bart-tiny-random", "patrickvonplaten/t5-tiny-random"]
|
||||
TF_SUMMARIZATION_FINETUNED_MODELS = ["sshleifer/bart-tiny-random", "patrickvonplaten/t5-tiny-random"]
|
||||
|
||||
TRANSLATION_FINETUNED_MODELS = [
|
||||
("patrickvonplaten/t5-tiny-random", "translation_en_to_de"),
|
||||
("patrickvonplaten/t5-tiny-random", "translation_en_to_ro"),
|
||||
]
|
||||
TF_TRANSLATION_FINETUNED_MODELS = [("patrickvonplaten/t5-tiny-random", "translation_en_to_fr")]
|
||||
|
||||
TEXT2TEXT_FINETUNED_MODELS = ["patrickvonplaten/t5-tiny-random"]
|
||||
TF_TEXT2TEXT_FINETUNED_MODELS = ["patrickvonplaten/t5-tiny-random"]
|
||||
|
||||
DIALOGUE_FINETUNED_MODELS = ["microsoft/DialoGPT-medium"] # @slow
|
||||
|
||||
expected_fill_mask_result = [
|
||||
[
|
||||
{"sequence": "<s>My name is John</s>", "score": 0.00782308354973793, "token": 610, "token_str": "ĠJohn"},
|
||||
{"sequence": "<s>My name is Chris</s>", "score": 0.007475061342120171, "token": 1573, "token_str": "ĠChris"},
|
||||
],
|
||||
[
|
||||
{"sequence": "<s>The largest city in France is Paris</s>", "score": 0.3185044229030609, "token": 2201},
|
||||
{"sequence": "<s>The largest city in France is Lyon</s>", "score": 0.21112334728240967, "token": 12790},
|
||||
],
|
||||
]
|
||||
|
||||
expected_fill_mask_target_result = [
|
||||
[
|
||||
{
|
||||
"sequence": "<s>My name is Patrick</s>",
|
||||
"score": 0.004992353264242411,
|
||||
"token": 3499,
|
||||
"token_str": "ĠPatrick",
|
||||
},
|
||||
{
|
||||
"sequence": "<s>My name is Clara</s>",
|
||||
"score": 0.00019297805556561798,
|
||||
"token": 13606,
|
||||
"token_str": "ĠClara",
|
||||
},
|
||||
]
|
||||
]
|
||||
|
||||
SUMMARIZATION_KWARGS = dict(num_beams=2, min_length=2, max_length=5)
|
||||
|
||||
|
||||
class DefaultArgumentHandlerTestCase(unittest.TestCase):
|
||||
def setUp(self) -> None:
|
||||
self.handler = DefaultArgumentHandler()
|
||||
|
||||
def test_kwargs_x(self):
|
||||
mono_data = {"X": "This is a sample input"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = {"x": ["This is a sample input", "This is a second sample input"]}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
def test_kwargs_data(self):
|
||||
mono_data = {"data": "This is a sample input"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = {"data": ["This is a sample input", "This is a second sample input"]}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
def test_multi_kwargs(self):
|
||||
mono_data = {"data": "This is a sample input", "X": "This is a sample input 2"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 2)
|
||||
|
||||
multi_data = {
|
||||
"data": ["This is a sample input", "This is a second sample input"],
|
||||
"test": ["This is a sample input 2", "This is a second sample input 2"],
|
||||
}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 4)
|
||||
|
||||
def test_args(self):
|
||||
mono_data = "This is a sample input"
|
||||
mono_args = self.handler(mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
mono_data = ["This is a sample input"]
|
||||
mono_args = self.handler(mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = ["This is a sample input", "This is a second sample input"]
|
||||
multi_args = self.handler(multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
multi_data = ["This is a sample input", "This is a second sample input"]
|
||||
multi_args = self.handler(*multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
|
||||
class MonoColumnInputTestCase(unittest.TestCase):
|
||||
def _test_mono_column_pipeline(
|
||||
self,
|
||||
nlp: Pipeline,
|
||||
valid_inputs: List,
|
||||
output_keys: Iterable[str],
|
||||
invalid_inputs: List = [None],
|
||||
expected_multi_result: Optional[List] = None,
|
||||
expected_check_keys: Optional[List[str]] = None,
|
||||
**kwargs,
|
||||
):
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(valid_inputs[0], **kwargs)
|
||||
self.assertIsInstance(mono_result, list)
|
||||
self.assertIsInstance(mono_result[0], (dict, list))
|
||||
|
||||
if isinstance(mono_result[0], list):
|
||||
mono_result = mono_result[0]
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result[0])
|
||||
|
||||
multi_result = [nlp(input, **kwargs) for input in valid_inputs]
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], (dict, list))
|
||||
|
||||
if expected_multi_result is not None:
|
||||
for result, expect in zip(multi_result, expected_multi_result):
|
||||
for key in expected_check_keys or []:
|
||||
self.assertEqual(
|
||||
set([o[key] for o in result]),
|
||||
set([o[key] for o in expect]),
|
||||
)
|
||||
|
||||
if isinstance(multi_result[0], list):
|
||||
multi_result = multi_result[0]
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
self.assertRaises(Exception, nlp, invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
def test_torch_sentiment_analysis(self):
|
||||
mandatory_keys = {"label", "score"}
|
||||
for model_name in TEXT_CLASSIF_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="sentiment-analysis", model=model_name, tokenizer=model_name)
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, mandatory_keys)
|
||||
|
||||
@require_tf
|
||||
def test_tf_sentiment_analysis(self):
|
||||
mandatory_keys = {"label", "score"}
|
||||
for model_name in TEXT_CLASSIF_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="sentiment-analysis", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, mandatory_keys)
|
||||
|
||||
@require_torch
|
||||
def test_torch_feature_extraction(self):
|
||||
for model_name in FEATURE_EXTRACT_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="feature-extraction", model=model_name, tokenizer=model_name)
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {})
|
||||
|
||||
@require_tf
|
||||
def test_tf_feature_extraction(self):
|
||||
for model_name in FEATURE_EXTRACT_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="feature-extraction", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {})
|
||||
|
||||
@require_torch
|
||||
def test_torch_fill_mask(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
invalid_inputs = [
|
||||
"This is <mask> <mask>" # More than 1 mask_token in the input is not supported
|
||||
"This is" # No mask_token is not supported
|
||||
]
|
||||
for model_name in FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(
|
||||
task="fill-mask",
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
top_k=2,
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp, valid_inputs, mandatory_keys, invalid_inputs, expected_check_keys=["sequence"]
|
||||
)
|
||||
|
||||
@require_tf
|
||||
def test_tf_fill_mask(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
invalid_inputs = [
|
||||
"This is <mask> <mask>" # More than 1 mask_token in the input is not supported
|
||||
"This is" # No mask_token is not supported
|
||||
]
|
||||
for model_name in FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(
|
||||
task="fill-mask",
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="tf",
|
||||
top_k=2,
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp, valid_inputs, mandatory_keys, invalid_inputs, expected_check_keys=["sequence"]
|
||||
)
|
||||
|
||||
@require_torch
|
||||
def test_torch_fill_mask_with_targets(self):
|
||||
valid_inputs = ["My name is <mask>"]
|
||||
valid_targets = [[" Teven", " Patrick", " Clara"], [" Sam"]]
|
||||
invalid_targets = [[], [""], ""]
|
||||
for model_name in FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="pt")
|
||||
for targets in valid_targets:
|
||||
outputs = nlp(valid_inputs, targets=targets)
|
||||
self.assertIsInstance(outputs, list)
|
||||
self.assertEqual(len(outputs), len(targets))
|
||||
for targets in invalid_targets:
|
||||
self.assertRaises(ValueError, nlp, valid_inputs, targets=targets)
|
||||
|
||||
@require_tf
|
||||
def test_tf_fill_mask_with_targets(self):
|
||||
valid_inputs = ["My name is <mask>"]
|
||||
valid_targets = [[" Teven", " Patrick", " Clara"], [" Sam"]]
|
||||
invalid_targets = [[], [""], ""]
|
||||
for model_name in FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="tf")
|
||||
for targets in valid_targets:
|
||||
outputs = nlp(valid_inputs, targets=targets)
|
||||
self.assertIsInstance(outputs, list)
|
||||
self.assertEqual(len(outputs), len(targets))
|
||||
for targets in invalid_targets:
|
||||
self.assertRaises(ValueError, nlp, valid_inputs, targets=targets)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_fill_mask_results(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
valid_targets = [" Patrick", " Clara"]
|
||||
for model_name in LARGE_FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(
|
||||
task="fill-mask",
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
top_k=2,
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs,
|
||||
mandatory_keys,
|
||||
expected_multi_result=expected_fill_mask_result,
|
||||
expected_check_keys=["sequence"],
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs[:1],
|
||||
mandatory_keys,
|
||||
expected_multi_result=expected_fill_mask_target_result,
|
||||
expected_check_keys=["sequence"],
|
||||
targets=valid_targets,
|
||||
)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_fill_mask_results(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
valid_targets = [" Patrick", " Clara"]
|
||||
for model_name in LARGE_FILL_MASK_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="tf", top_k=2)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs,
|
||||
mandatory_keys,
|
||||
expected_multi_result=expected_fill_mask_result,
|
||||
expected_check_keys=["sequence"],
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs[:1],
|
||||
mandatory_keys,
|
||||
expected_multi_result=expected_fill_mask_target_result,
|
||||
expected_check_keys=["sequence"],
|
||||
targets=valid_targets,
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_torch_summarization(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["summary_text"]
|
||||
for model in SUMMARIZATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="summarization", model=model, tokenizer=model)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp, VALID_INPUTS, mandatory_keys, invalid_inputs=invalid_inputs, **SUMMARIZATION_KWARGS
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_summarization(self):
|
||||
nlp = pipeline(task="summarization", device=DEFAULT_DEVICE_NUM)
|
||||
cnn_article = ' (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday\'s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court\'s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What\'s objectionable is the attempts to undermine international justice, not Palestine\'s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court\'s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN\'s Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.'
|
||||
expected_cnn_summary = " The Palestinian Authority becomes the 123rd member of the International Criminal Court . The move gives the court jurisdiction over alleged crimes in Palestinian territories . Israel and the United States opposed the Palestinians' efforts to join the court . Rights group Human Rights Watch welcomes the move, says governments seeking to penalize Palestine should end pressure ."
|
||||
result = nlp(cnn_article)
|
||||
self.assertEqual(result[0]["summary_text"], expected_cnn_summary)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_summarization(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["summary_text"]
|
||||
for model_name in TF_SUMMARIZATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(
|
||||
task="summarization",
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="tf",
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp, VALID_INPUTS, mandatory_keys, invalid_inputs=invalid_inputs, **SUMMARIZATION_KWARGS
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
@slow
|
||||
def test_torch_translation(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["translation_text"]
|
||||
for model_name, task in TRANSLATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(task=task, model=model_name, tokenizer=model_name)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
VALID_INPUTS,
|
||||
mandatory_keys,
|
||||
invalid_inputs,
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_default_translations(self):
|
||||
# We don't provide a default for this pair
|
||||
with self.assertRaises(ValueError):
|
||||
pipeline(task="translation_cn_to_ar")
|
||||
|
||||
# but we do for this one
|
||||
pipeline(task="translation_en_to_de")
|
||||
|
||||
@require_torch
|
||||
def test_translation_on_odd_language(self):
|
||||
model = TRANSLATION_FINETUNED_MODELS[0][0]
|
||||
pipeline(task="translation_cn_to_ar", model=model)
|
||||
|
||||
@require_torch
|
||||
def test_translation_default_language_selection(self):
|
||||
model = TRANSLATION_FINETUNED_MODELS[0][0]
|
||||
with pytest.warns(UserWarning, match=r".*translation_en_to_de.*"):
|
||||
nlp = pipeline(task="translation", model=model)
|
||||
self.assertEqual(nlp.task, "translation_en_to_de")
|
||||
|
||||
@require_torch
|
||||
def test_translation_with_no_language_no_model_fails(self):
|
||||
with self.assertRaises(ValueError):
|
||||
pipeline(task="translation")
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_translation(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["translation_text"]
|
||||
for model, task in TF_TRANSLATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(task=task, model=model, tokenizer=model, framework="tf")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, mandatory_keys, invalid_inputs=invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
@require_tokenizers
|
||||
def test_torch_text2text(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["generated_text"]
|
||||
for model_name in TEXT2TEXT_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="text2text-generation", model=model_name, tokenizer=model_name)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
VALID_INPUTS,
|
||||
mandatory_keys,
|
||||
invalid_inputs,
|
||||
)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_text2text(self):
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["generated_text"]
|
||||
for model in TEXT2TEXT_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="text2text-generation", model=model, tokenizer=model, framework="tf")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, mandatory_keys, invalid_inputs=invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
def test_torch_text_generation(self):
|
||||
for model_name in TEXT_GENERATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="text-generation", model=model_name, tokenizer=model_name, framework="pt")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {})
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {}, prefix="This is ")
|
||||
|
||||
@require_tf
|
||||
def test_tf_text_generation(self):
|
||||
for model_name in TEXT_GENERATION_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="text-generation", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {})
|
||||
self._test_mono_column_pipeline(nlp, VALID_INPUTS, {}, prefix="This is ")
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_conversation(self):
|
||||
# When
|
||||
nlp = pipeline(task="conversational", device=DEFAULT_DEVICE_NUM)
|
||||
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
|
||||
conversation_2 = Conversation("What's the last book you have read?")
|
||||
# Then
|
||||
self.assertEqual(len(conversation_1.past_user_inputs), 0)
|
||||
self.assertEqual(len(conversation_2.past_user_inputs), 0)
|
||||
# When
|
||||
result = nlp([conversation_1, conversation_2], do_sample=False, max_length=1000)
|
||||
# Then
|
||||
self.assertEqual(result, [conversation_1, conversation_2])
|
||||
self.assertEqual(len(result[0].past_user_inputs), 1)
|
||||
self.assertEqual(len(result[1].past_user_inputs), 1)
|
||||
self.assertEqual(len(result[0].generated_responses), 1)
|
||||
self.assertEqual(len(result[1].generated_responses), 1)
|
||||
self.assertEqual(result[0].past_user_inputs[0], "Going to the movies tonight - any suggestions?")
|
||||
self.assertEqual(result[0].generated_responses[0], "The Big Lebowski")
|
||||
self.assertEqual(result[1].past_user_inputs[0], "What's the last book you have read?")
|
||||
self.assertEqual(result[1].generated_responses[0], "The Last Question")
|
||||
# When
|
||||
conversation_2.add_user_input("Why do you recommend it?")
|
||||
result = nlp(conversation_2, do_sample=False, max_length=1000)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_2)
|
||||
self.assertEqual(len(result.past_user_inputs), 2)
|
||||
self.assertEqual(len(result.generated_responses), 2)
|
||||
self.assertEqual(result.past_user_inputs[1], "Why do you recommend it?")
|
||||
self.assertEqual(result.generated_responses[1], "It's a good book.")
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_conversation_truncated_history(self):
|
||||
# When
|
||||
nlp = pipeline(task="conversational", min_length_for_response=24, device=DEFAULT_DEVICE_NUM)
|
||||
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
|
||||
# Then
|
||||
self.assertEqual(len(conversation_1.past_user_inputs), 0)
|
||||
# When
|
||||
result = nlp(conversation_1, do_sample=False, max_length=36)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_1)
|
||||
self.assertEqual(len(result.past_user_inputs), 1)
|
||||
self.assertEqual(len(result.generated_responses), 1)
|
||||
self.assertEqual(result.past_user_inputs[0], "Going to the movies tonight - any suggestions?")
|
||||
self.assertEqual(result.generated_responses[0], "The Big Lebowski")
|
||||
# When
|
||||
conversation_1.add_user_input("Is it an action movie?")
|
||||
result = nlp(conversation_1, do_sample=False, max_length=36)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_1)
|
||||
self.assertEqual(len(result.past_user_inputs), 2)
|
||||
self.assertEqual(len(result.generated_responses), 2)
|
||||
self.assertEqual(result.past_user_inputs[1], "Is it an action movie?")
|
||||
self.assertEqual(result.generated_responses[1], "It's a comedy.")
|
||||
|
||||
|
||||
QA_FINETUNED_MODELS = ["sshleifer/tiny-distilbert-base-cased-distilled-squad"]
|
||||
|
||||
|
||||
class ZeroShotClassificationPipelineTests(unittest.TestCase):
|
||||
def _test_scores_sum_to_one(self, result):
|
||||
sum = 0.0
|
||||
for score in result["scores"]:
|
||||
sum += score
|
||||
self.assertAlmostEqual(sum, 1.0)
|
||||
|
||||
def _test_zero_shot_pipeline(self, nlp):
|
||||
output_keys = {"sequence", "labels", "scores"}
|
||||
valid_mono_inputs = [
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": "politics"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ["politics"]},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": "politics, public health"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ["politics", "public health"]},
|
||||
{"sequences": ["Who are you voting for in 2020?"], "candidate_labels": "politics"},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "This text is about {}",
|
||||
},
|
||||
]
|
||||
valid_multi_input = {
|
||||
"sequences": ["Who are you voting for in 2020?", "What is the capital of Spain?"],
|
||||
"candidate_labels": "politics",
|
||||
}
|
||||
invalid_inputs = [
|
||||
{"sequences": None, "candidate_labels": "politics"},
|
||||
{"sequences": "", "candidate_labels": "politics"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": None},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ""},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": None,
|
||||
},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "",
|
||||
},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "Template without formatting syntax.",
|
||||
},
|
||||
]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
for mono_input in valid_mono_inputs:
|
||||
mono_result = nlp(**mono_input)
|
||||
self.assertIsInstance(mono_result, dict)
|
||||
if len(mono_result["labels"]) > 1:
|
||||
self._test_scores_sum_to_one(mono_result)
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result)
|
||||
|
||||
multi_result = nlp(**valid_multi_input)
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], dict)
|
||||
self.assertEqual(len(multi_result), len(valid_multi_input["sequences"]))
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
if len(result["labels"]) > 1:
|
||||
self._test_scores_sum_to_one(result)
|
||||
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, **bad_input)
|
||||
|
||||
def _test_zero_shot_pipeline_outputs(self, nlp):
|
||||
inputs = [
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": ["politics", "public health", "science"],
|
||||
},
|
||||
{
|
||||
"sequences": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
|
||||
"candidate_labels": ["machine learning", "statistics", "translation", "vision"],
|
||||
"multi_class": True,
|
||||
},
|
||||
]
|
||||
|
||||
expected_outputs = [
|
||||
{
|
||||
"sequence": "Who are you voting for in 2020?",
|
||||
"labels": ["politics", "public health", "science"],
|
||||
"scores": [0.975, 0.015, 0.008],
|
||||
},
|
||||
{
|
||||
"sequence": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
|
||||
"labels": ["translation", "machine learning", "vision", "statistics"],
|
||||
"scores": [0.817, 0.712, 0.018, 0.017],
|
||||
},
|
||||
]
|
||||
|
||||
for input, expected_output in zip(inputs, expected_outputs):
|
||||
output = nlp(**input)
|
||||
for key in output:
|
||||
if key == "scores":
|
||||
for output_score, expected_score in zip(output[key], expected_output[key]):
|
||||
self.assertAlmostEqual(output_score, expected_score, places=2)
|
||||
else:
|
||||
self.assertEqual(output[key], expected_output[key])
|
||||
|
||||
@require_torch
|
||||
def test_torch_zero_shot_classification(self):
|
||||
for model_name in TEXT_CLASSIF_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="zero-shot-classification", model=model_name, tokenizer=model_name)
|
||||
self._test_zero_shot_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
def test_tf_zero_shot_classification(self):
|
||||
for model_name in TEXT_CLASSIF_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="zero-shot-classification", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_zero_shot_pipeline(nlp)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_zero_shot_outputs(self):
|
||||
nlp = pipeline(task="zero-shot-classification", model="roberta-large-mnli")
|
||||
self._test_zero_shot_pipeline_outputs(nlp)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_zero_shot_outputs(self):
|
||||
nlp = pipeline(task="zero-shot-classification", model="roberta-large-mnli", framework="tf")
|
||||
self._test_zero_shot_pipeline_outputs(nlp)
|
||||
|
||||
|
||||
class DialoguePipelineTests(unittest.TestCase):
|
||||
def _test_conversation_pipeline(self, nlp):
|
||||
valid_inputs = [Conversation("Hi there!"), [Conversation("Hi there!"), Conversation("How are you?")]]
|
||||
invalid_inputs = ["Hi there!", Conversation()]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(valid_inputs[0])
|
||||
self.assertIsInstance(mono_result, Conversation)
|
||||
|
||||
multi_result = nlp(valid_inputs[1])
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], Conversation)
|
||||
# Inactive conversations passed to the pipeline raise a ValueError
|
||||
self.assertRaises(ValueError, nlp, valid_inputs[1])
|
||||
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, bad_input)
|
||||
self.assertRaises(Exception, nlp, invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_conversation(self):
|
||||
for model_name in DIALOGUE_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="conversational", model=model_name, tokenizer=model_name)
|
||||
self._test_conversation_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_conversation(self):
|
||||
for model_name in DIALOGUE_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="conversational", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_conversation_pipeline(nlp)
|
||||
|
||||
|
||||
class QAPipelineTests(unittest.TestCase):
|
||||
def _test_qa_pipeline(self, nlp):
|
||||
output_keys = {"score", "answer", "start", "end"}
|
||||
valid_inputs = [
|
||||
{"question": "Where was HuggingFace founded ?", "context": "HuggingFace was founded in Paris."},
|
||||
{
|
||||
"question": "In what field is HuggingFace working ?",
|
||||
"context": "HuggingFace is a startup based in New-York founded in Paris which is trying to solve NLP.",
|
||||
},
|
||||
]
|
||||
invalid_inputs = [
|
||||
{"question": "", "context": "This is a test to try empty question edge case"},
|
||||
{"question": None, "context": "This is a test to try empty question edge case"},
|
||||
{"question": "What is does with empty context ?", "context": ""},
|
||||
{"question": "What is does with empty context ?", "context": None},
|
||||
]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(valid_inputs[0])
|
||||
self.assertIsInstance(mono_result, dict)
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result)
|
||||
|
||||
multi_result = nlp(valid_inputs)
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], dict)
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, bad_input)
|
||||
self.assertRaises(Exception, nlp, invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
def test_torch_question_answering(self):
|
||||
for model_name in QA_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="question-answering", model=model_name, tokenizer=model_name)
|
||||
self._test_qa_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
def test_tf_question_answering(self):
|
||||
for model_name in QA_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="question-answering", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_qa_pipeline(nlp)
|
||||
|
||||
|
||||
class NerPipelineTests(unittest.TestCase):
|
||||
def _test_ner_pipeline(
|
||||
self,
|
||||
nlp: Pipeline,
|
||||
output_keys: Iterable[str],
|
||||
):
|
||||
|
||||
ungrouped_ner_inputs = [
|
||||
[
|
||||
{"entity": "B-PER", "index": 1, "score": 0.9994944930076599, "word": "Cons"},
|
||||
{"entity": "B-PER", "index": 2, "score": 0.8025449514389038, "word": "##uelo"},
|
||||
{"entity": "I-PER", "index": 3, "score": 0.9993102550506592, "word": "Ara"},
|
||||
{"entity": "I-PER", "index": 4, "score": 0.9993743896484375, "word": "##új"},
|
||||
{"entity": "I-PER", "index": 5, "score": 0.9992871880531311, "word": "##o"},
|
||||
{"entity": "I-PER", "index": 6, "score": 0.9993029236793518, "word": "No"},
|
||||
{"entity": "I-PER", "index": 7, "score": 0.9981776475906372, "word": "##guera"},
|
||||
{"entity": "B-PER", "index": 15, "score": 0.9998136162757874, "word": "Andrés"},
|
||||
{"entity": "I-PER", "index": 16, "score": 0.999740719795227, "word": "Pas"},
|
||||
{"entity": "I-PER", "index": 17, "score": 0.9997414350509644, "word": "##tran"},
|
||||
{"entity": "I-PER", "index": 18, "score": 0.9996136426925659, "word": "##a"},
|
||||
{"entity": "B-ORG", "index": 28, "score": 0.9989739060401917, "word": "Far"},
|
||||
{"entity": "I-ORG", "index": 29, "score": 0.7188422083854675, "word": "##c"},
|
||||
],
|
||||
[
|
||||
{"entity": "I-PER", "index": 1, "score": 0.9968166351318359, "word": "En"},
|
||||
{"entity": "I-PER", "index": 2, "score": 0.9957635998725891, "word": "##zo"},
|
||||
{"entity": "I-ORG", "index": 7, "score": 0.9986497163772583, "word": "UN"},
|
||||
],
|
||||
]
|
||||
expected_grouped_ner_results = [
|
||||
[
|
||||
{"entity_group": "B-PER", "score": 0.9710702640669686, "word": "Consuelo Araújo Noguera"},
|
||||
{"entity_group": "B-PER", "score": 0.9997273534536362, "word": "Andrés Pastrana"},
|
||||
{"entity_group": "B-ORG", "score": 0.8589080572128296, "word": "Farc"},
|
||||
],
|
||||
[
|
||||
{"entity_group": "I-PER", "score": 0.9962901175022125, "word": "Enzo"},
|
||||
{"entity_group": "I-ORG", "score": 0.9986497163772583, "word": "UN"},
|
||||
],
|
||||
]
|
||||
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(VALID_INPUTS[0])
|
||||
self.assertIsInstance(mono_result, list)
|
||||
self.assertIsInstance(mono_result[0], (dict, list))
|
||||
|
||||
if isinstance(mono_result[0], list):
|
||||
mono_result = mono_result[0]
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result[0])
|
||||
|
||||
multi_result = [nlp(input) for input in VALID_INPUTS]
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], (dict, list))
|
||||
|
||||
if isinstance(multi_result[0], list):
|
||||
multi_result = multi_result[0]
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
for ungrouped_input, grouped_result in zip(ungrouped_ner_inputs, expected_grouped_ner_results):
|
||||
self.assertEqual(nlp.group_entities(ungrouped_input), grouped_result)
|
||||
|
||||
@require_torch
|
||||
def test_torch_ner(self):
|
||||
mandatory_keys = {"entity", "word", "score"}
|
||||
for model_name in NER_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name)
|
||||
self._test_ner_pipeline(nlp, mandatory_keys)
|
||||
|
||||
@require_torch
|
||||
def test_ner_grouped(self):
|
||||
mandatory_keys = {"entity_group", "word", "score"}
|
||||
for model_name in NER_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name, grouped_entities=True)
|
||||
self._test_ner_pipeline(nlp, mandatory_keys)
|
||||
|
||||
@require_tf
|
||||
def test_tf_ner(self):
|
||||
mandatory_keys = {"entity", "word", "score"}
|
||||
for model_name in NER_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_ner_pipeline(nlp, mandatory_keys)
|
||||
|
||||
@require_tf
|
||||
def test_tf_ner_grouped(self):
|
||||
mandatory_keys = {"entity_group", "word", "score"}
|
||||
for model_name in NER_FINETUNED_MODELS:
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name, framework="tf", grouped_entities=True)
|
||||
self._test_ner_pipeline(nlp, mandatory_keys)
|
||||
|
||||
@require_tf
|
||||
def test_tf_only_ner(self):
|
||||
mandatory_keys = {"entity", "word", "score"}
|
||||
for model_name in TF_NER_FINETUNED_MODELS:
|
||||
# We don't specificy framework='tf' but it gets detected automatically
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name)
|
||||
self._test_ner_pipeline(nlp, mandatory_keys)
|
||||
|
||||
|
||||
class PipelineCommonTests(unittest.TestCase):
|
||||
pipelines = SUPPORTED_TASKS.keys()
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_defaults(self):
|
||||
# Test that pipelines can be correctly loaded without any argument
|
||||
for task in self.pipelines:
|
||||
with self.subTest(msg="Testing TF defaults with TF and {}".format(task)):
|
||||
pipeline(task, framework="tf")
|
||||
pipeline(task)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_pt_defaults(self):
|
||||
# Test that pipelines can be correctly loaded without any argument
|
||||
for task in self.pipelines:
|
||||
with self.subTest(msg="Testing Torch defaults with PyTorch and {}".format(task)):
|
||||
pipeline(task, framework="pt")
|
||||
pipeline(task)
|
|
@ -0,0 +1,273 @@
|
|||
import unittest
|
||||
from typing import List, Optional
|
||||
|
||||
from transformers import is_tf_available, is_torch_available, pipeline
|
||||
from transformers.pipelines import DefaultArgumentHandler, Pipeline
|
||||
from transformers.testing_utils import _run_slow_tests, is_pipeline_test, require_tf, require_torch, slow
|
||||
|
||||
|
||||
VALID_INPUTS = ["A simple string", ["list of strings"]]
|
||||
|
||||
|
||||
@is_pipeline_test
|
||||
class CustomInputPipelineCommonMixin:
|
||||
pipeline_task = None
|
||||
pipeline_loading_kwargs = {}
|
||||
small_models = None # Models tested without the @slow decorator
|
||||
large_models = None # Models tested with the @slow decorator
|
||||
|
||||
def setUp(self) -> None:
|
||||
if not is_tf_available() and not is_torch_available():
|
||||
return # Currently no JAX pipelines
|
||||
|
||||
# Download needed checkpoints
|
||||
models = self.small_models
|
||||
if _run_slow_tests:
|
||||
models = models + self.large_models
|
||||
|
||||
for model_name in models:
|
||||
if is_torch_available():
|
||||
pipeline(
|
||||
self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
if is_tf_available():
|
||||
pipeline(
|
||||
self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="tf",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_pt_defaults(self):
|
||||
pipeline(self.pipeline_task, framework="pt")
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_defaults(self):
|
||||
pipeline(self.pipeline_task, framework="tf")
|
||||
|
||||
@require_torch
|
||||
def test_torch_small(self):
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="pt")
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
def test_tf_small(self):
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_large(self):
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="pt")
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_large(self):
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(task=self.pipeline_task, model=model_name, tokenizer=model_name, framework="tf")
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
@is_pipeline_test
|
||||
class MonoInputPipelineCommonMixin:
|
||||
pipeline_task = None
|
||||
pipeline_loading_kwargs = {} # Additional kwargs to load the pipeline with
|
||||
pipeline_running_kwargs = {} # Additional kwargs to run the pipeline with
|
||||
small_models = [] # Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
||||
mandatory_keys = {} # Keys which should be in the output
|
||||
valid_inputs = VALID_INPUTS # inputs which are valid
|
||||
invalid_inputs = [None] # inputs which are not allowed
|
||||
expected_multi_result: Optional[List] = None
|
||||
expected_check_keys: Optional[List[str]] = None
|
||||
|
||||
def setUp(self) -> None:
|
||||
if not is_tf_available() and not is_torch_available():
|
||||
return # Currently no JAX pipelines
|
||||
|
||||
for model_name in self.small_models:
|
||||
pipeline(self.pipeline_task, model=model_name, tokenizer=model_name, **self.pipeline_loading_kwargs)
|
||||
for model_name in self.large_models:
|
||||
pipeline(self.pipeline_task, model=model_name, tokenizer=model_name, **self.pipeline_loading_kwargs)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_pt_defaults_loads(self):
|
||||
pipeline(self.pipeline_task, framework="pt", **self.pipeline_loading_kwargs)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_defaults_loads(self):
|
||||
pipeline(self.pipeline_task, framework="tf", **self.pipeline_loading_kwargs)
|
||||
|
||||
@require_torch
|
||||
def test_torch_small(self):
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(
|
||||
task=self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
def test_tf_small(self):
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(
|
||||
task=self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="tf",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_large(self):
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(
|
||||
task=self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_large(self):
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(
|
||||
task=self.pipeline_task,
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="tf",
|
||||
**self.pipeline_loading_kwargs,
|
||||
)
|
||||
self._test_pipeline(nlp)
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(self.valid_inputs[0], **self.pipeline_running_kwargs)
|
||||
self.assertIsInstance(mono_result, list)
|
||||
self.assertIsInstance(mono_result[0], (dict, list))
|
||||
|
||||
if isinstance(mono_result[0], list):
|
||||
mono_result = mono_result[0]
|
||||
|
||||
for key in self.mandatory_keys:
|
||||
self.assertIn(key, mono_result[0])
|
||||
|
||||
multi_result = [nlp(input, **self.pipeline_running_kwargs) for input in self.valid_inputs]
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], (dict, list))
|
||||
|
||||
if self.expected_multi_result is not None:
|
||||
for result, expect in zip(multi_result, self.expected_multi_result):
|
||||
for key in self.expected_check_keys or []:
|
||||
self.assertEqual(
|
||||
set([o[key] for o in result]),
|
||||
set([o[key] for o in expect]),
|
||||
)
|
||||
|
||||
if isinstance(multi_result[0], list):
|
||||
multi_result = multi_result[0]
|
||||
|
||||
for result in multi_result:
|
||||
for key in self.mandatory_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
self.assertRaises(Exception, nlp, self.invalid_inputs)
|
||||
|
||||
|
||||
@is_pipeline_test
|
||||
class DefaultArgumentHandlerTestCase(unittest.TestCase):
|
||||
def setUp(self) -> None:
|
||||
self.handler = DefaultArgumentHandler()
|
||||
|
||||
def test_kwargs_x(self):
|
||||
mono_data = {"X": "This is a sample input"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = {"x": ["This is a sample input", "This is a second sample input"]}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
def test_kwargs_data(self):
|
||||
mono_data = {"data": "This is a sample input"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = {"data": ["This is a sample input", "This is a second sample input"]}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
def test_multi_kwargs(self):
|
||||
mono_data = {"data": "This is a sample input", "X": "This is a sample input 2"}
|
||||
mono_args = self.handler(**mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 2)
|
||||
|
||||
multi_data = {
|
||||
"data": ["This is a sample input", "This is a second sample input"],
|
||||
"test": ["This is a sample input 2", "This is a second sample input 2"],
|
||||
}
|
||||
multi_args = self.handler(**multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 4)
|
||||
|
||||
def test_args(self):
|
||||
mono_data = "This is a sample input"
|
||||
mono_args = self.handler(mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
mono_data = ["This is a sample input"]
|
||||
mono_args = self.handler(mono_data)
|
||||
|
||||
self.assertTrue(isinstance(mono_args, list))
|
||||
self.assertEqual(len(mono_args), 1)
|
||||
|
||||
multi_data = ["This is a sample input", "This is a second sample input"]
|
||||
multi_args = self.handler(multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
||||
|
||||
multi_data = ["This is a sample input", "This is a second sample input"]
|
||||
multi_args = self.handler(*multi_data)
|
||||
|
||||
self.assertTrue(isinstance(multi_args, list))
|
||||
self.assertEqual(len(multi_args), 2)
|
|
@ -0,0 +1,93 @@
|
|||
import unittest
|
||||
|
||||
from transformers import Conversation, pipeline
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
DEFAULT_DEVICE_NUM = -1 if torch_device == "cpu" else 0
|
||||
|
||||
|
||||
class TextGenerationPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "conversational"
|
||||
small_models = [] # Models tested without the @slow decorator
|
||||
large_models = ["microsoft/DialoGPT-medium"] # Models tested with the @slow decorator
|
||||
valid_inputs = [Conversation("Hi there!"), [Conversation("Hi there!"), Conversation("How are you?")]]
|
||||
invalid_inputs = ["Hi there!", Conversation()]
|
||||
|
||||
def _test_pipeline(
|
||||
self, nlp
|
||||
): # e overide the default test method to check that the output is a `Conversation` object
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(self.valid_inputs[0])
|
||||
self.assertIsInstance(mono_result, Conversation)
|
||||
|
||||
multi_result = nlp(self.valid_inputs[1])
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], Conversation)
|
||||
# Inactive conversations passed to the pipeline raise a ValueError
|
||||
self.assertRaises(ValueError, nlp, self.valid_inputs[1])
|
||||
|
||||
for bad_input in self.invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, bad_input)
|
||||
self.assertRaises(Exception, nlp, self.invalid_inputs)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_conversation(self):
|
||||
# When
|
||||
nlp = pipeline(task="conversational", device=DEFAULT_DEVICE_NUM)
|
||||
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
|
||||
conversation_2 = Conversation("What's the last book you have read?")
|
||||
# Then
|
||||
self.assertEqual(len(conversation_1.past_user_inputs), 0)
|
||||
self.assertEqual(len(conversation_2.past_user_inputs), 0)
|
||||
# When
|
||||
result = nlp([conversation_1, conversation_2], do_sample=False, max_length=1000)
|
||||
# Then
|
||||
self.assertEqual(result, [conversation_1, conversation_2])
|
||||
self.assertEqual(len(result[0].past_user_inputs), 1)
|
||||
self.assertEqual(len(result[1].past_user_inputs), 1)
|
||||
self.assertEqual(len(result[0].generated_responses), 1)
|
||||
self.assertEqual(len(result[1].generated_responses), 1)
|
||||
self.assertEqual(result[0].past_user_inputs[0], "Going to the movies tonight - any suggestions?")
|
||||
self.assertEqual(result[0].generated_responses[0], "The Big Lebowski")
|
||||
self.assertEqual(result[1].past_user_inputs[0], "What's the last book you have read?")
|
||||
self.assertEqual(result[1].generated_responses[0], "The Last Question")
|
||||
# When
|
||||
conversation_2.add_user_input("Why do you recommend it?")
|
||||
result = nlp(conversation_2, do_sample=False, max_length=1000)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_2)
|
||||
self.assertEqual(len(result.past_user_inputs), 2)
|
||||
self.assertEqual(len(result.generated_responses), 2)
|
||||
self.assertEqual(result.past_user_inputs[1], "Why do you recommend it?")
|
||||
self.assertEqual(result.generated_responses[1], "It's a good book.")
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_conversation_truncated_history(self):
|
||||
# When
|
||||
nlp = pipeline(task="conversational", min_length_for_response=24, device=DEFAULT_DEVICE_NUM)
|
||||
conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
|
||||
# Then
|
||||
self.assertEqual(len(conversation_1.past_user_inputs), 0)
|
||||
# When
|
||||
result = nlp(conversation_1, do_sample=False, max_length=36)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_1)
|
||||
self.assertEqual(len(result.past_user_inputs), 1)
|
||||
self.assertEqual(len(result.generated_responses), 1)
|
||||
self.assertEqual(result.past_user_inputs[0], "Going to the movies tonight - any suggestions?")
|
||||
self.assertEqual(result.generated_responses[0], "The Big Lebowski")
|
||||
# When
|
||||
conversation_1.add_user_input("Is it an action movie?")
|
||||
result = nlp(conversation_1, do_sample=False, max_length=36)
|
||||
# Then
|
||||
self.assertEqual(result, conversation_1)
|
||||
self.assertEqual(len(result.past_user_inputs), 2)
|
||||
self.assertEqual(len(result.generated_responses), 2)
|
||||
self.assertEqual(result.past_user_inputs[1], "Is it an action movie?")
|
||||
self.assertEqual(result.generated_responses[1], "It's a comedy.")
|
|
@ -0,0 +1,29 @@
|
|||
import unittest
|
||||
|
||||
from transformers.pipelines import Conversation, Pipeline
|
||||
|
||||
from .test_pipelines_common import CustomInputPipelineCommonMixin
|
||||
|
||||
|
||||
class DialoguePipelineTests(CustomInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "conversational"
|
||||
small_models = [] # Default model - Models tested without the @slow decorator
|
||||
large_models = ["microsoft/DialoGPT-medium"] # Models tested with the @slow decorator
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
valid_inputs = [Conversation("Hi there!"), [Conversation("Hi there!"), Conversation("How are you?")]]
|
||||
invalid_inputs = ["Hi there!", Conversation()]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(valid_inputs[0])
|
||||
self.assertIsInstance(mono_result, Conversation)
|
||||
|
||||
multi_result = nlp(valid_inputs[1])
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], Conversation)
|
||||
# Inactive conversations passed to the pipeline raise a ValueError
|
||||
self.assertRaises(ValueError, nlp, valid_inputs[1])
|
||||
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, bad_input)
|
||||
self.assertRaises(Exception, nlp, invalid_inputs)
|
|
@ -0,0 +1,12 @@
|
|||
import unittest
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
class FeatureExtractionPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "feature-extraction"
|
||||
small_models = [
|
||||
"sshleifer/tiny-distilbert-base-cased"
|
||||
] # Default model - Models tested without the @slow decorator
|
||||
large_models = [None] # Models tested with the @slow decorator
|
||||
mandatory_keys = {} # Keys which should be in the output
|
|
@ -0,0 +1,140 @@
|
|||
import unittest
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.testing_utils import require_tf, require_torch, slow
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
EXPECTED_FILL_MASK_RESULT = [
|
||||
[
|
||||
{"sequence": "<s>My name is John</s>", "score": 0.00782308354973793, "token": 610, "token_str": "ĠJohn"},
|
||||
{"sequence": "<s>My name is Chris</s>", "score": 0.007475061342120171, "token": 1573, "token_str": "ĠChris"},
|
||||
],
|
||||
[
|
||||
{"sequence": "<s>The largest city in France is Paris</s>", "score": 0.3185044229030609, "token": 2201},
|
||||
{"sequence": "<s>The largest city in France is Lyon</s>", "score": 0.21112334728240967, "token": 12790},
|
||||
],
|
||||
]
|
||||
|
||||
EXPECTED_FILL_MASK_TARGET_RESULT = [
|
||||
[
|
||||
{
|
||||
"sequence": "<s>My name is Patrick</s>",
|
||||
"score": 0.004992353264242411,
|
||||
"token": 3499,
|
||||
"token_str": "ĠPatrick",
|
||||
},
|
||||
{
|
||||
"sequence": "<s>My name is Clara</s>",
|
||||
"score": 0.00019297805556561798,
|
||||
"token": 13606,
|
||||
"token_str": "ĠClara",
|
||||
},
|
||||
]
|
||||
]
|
||||
|
||||
|
||||
class FillMaskPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "fill-mask"
|
||||
pipeline_loading_kwargs = {"topk": 2}
|
||||
small_models = ["sshleifer/tiny-distilroberta-base"] # Models tested without the @slow decorator
|
||||
large_models = ["distilroberta-base"] # Models tested with the @slow decorator
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
invalid_inputs = [
|
||||
"This is <mask> <mask>" # More than 1 mask_token in the input is not supported
|
||||
"This is" # No mask_token is not supported
|
||||
]
|
||||
expected_check_keys = ["sequence"]
|
||||
|
||||
@require_torch
|
||||
def test_torch_fill_mask_with_targets(self):
|
||||
valid_inputs = ["My name is <mask>"]
|
||||
valid_targets = [[" Teven", " Patrick", " Clara"], [" Sam"]]
|
||||
invalid_targets = [[], [""], ""]
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="pt")
|
||||
for targets in valid_targets:
|
||||
outputs = nlp(valid_inputs, targets=targets)
|
||||
self.assertIsInstance(outputs, list)
|
||||
self.assertEqual(len(outputs), len(targets))
|
||||
for targets in invalid_targets:
|
||||
self.assertRaises(ValueError, nlp, valid_inputs, targets=targets)
|
||||
|
||||
@require_tf
|
||||
def test_tf_fill_mask_with_targets(self):
|
||||
valid_inputs = ["My name is <mask>"]
|
||||
valid_targets = [[" Teven", " Patrick", " Clara"], [" Sam"]]
|
||||
invalid_targets = [[], [""], ""]
|
||||
for model_name in self.small_models:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="tf")
|
||||
for targets in valid_targets:
|
||||
outputs = nlp(valid_inputs, targets=targets)
|
||||
self.assertIsInstance(outputs, list)
|
||||
self.assertEqual(len(outputs), len(targets))
|
||||
for targets in invalid_targets:
|
||||
self.assertRaises(ValueError, nlp, valid_inputs, targets=targets)
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_torch_fill_mask_results(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
valid_targets = [" Patrick", " Clara"]
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(
|
||||
task="fill-mask",
|
||||
model=model_name,
|
||||
tokenizer=model_name,
|
||||
framework="pt",
|
||||
topk=2,
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs,
|
||||
mandatory_keys,
|
||||
expected_multi_result=EXPECTED_FILL_MASK_RESULT,
|
||||
expected_check_keys=["sequence"],
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs[:1],
|
||||
mandatory_keys,
|
||||
expected_multi_result=EXPECTED_FILL_MASK_TARGET_RESULT,
|
||||
expected_check_keys=["sequence"],
|
||||
targets=valid_targets,
|
||||
)
|
||||
|
||||
@require_tf
|
||||
@slow
|
||||
def test_tf_fill_mask_results(self):
|
||||
mandatory_keys = {"sequence", "score", "token"}
|
||||
valid_inputs = [
|
||||
"My name is <mask>",
|
||||
"The largest city in France is <mask>",
|
||||
]
|
||||
valid_targets = [" Patrick", " Clara"]
|
||||
for model_name in self.large_models:
|
||||
nlp = pipeline(task="fill-mask", model=model_name, tokenizer=model_name, framework="tf", topk=2)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs,
|
||||
mandatory_keys,
|
||||
expected_multi_result=EXPECTED_FILL_MASK_RESULT,
|
||||
expected_check_keys=["sequence"],
|
||||
)
|
||||
self._test_mono_column_pipeline(
|
||||
nlp,
|
||||
valid_inputs[:1],
|
||||
mandatory_keys,
|
||||
expected_multi_result=EXPECTED_FILL_MASK_TARGET_RESULT,
|
||||
expected_check_keys=["sequence"],
|
||||
targets=valid_targets,
|
||||
)
|
|
@ -0,0 +1,88 @@
|
|||
import unittest
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.pipelines import Pipeline
|
||||
from transformers.testing_utils import require_tf
|
||||
|
||||
from .test_pipelines_common import CustomInputPipelineCommonMixin
|
||||
|
||||
|
||||
VALID_INPUTS = ["A simple string", ["list of strings"]]
|
||||
|
||||
|
||||
class NerPipelineTests(CustomInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "ner"
|
||||
small_models = [
|
||||
"sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english"
|
||||
] # Default model - Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
output_keys = {"entity", "word", "score"}
|
||||
|
||||
ungrouped_ner_inputs = [
|
||||
[
|
||||
{"entity": "B-PER", "index": 1, "score": 0.9994944930076599, "word": "Cons"},
|
||||
{"entity": "B-PER", "index": 2, "score": 0.8025449514389038, "word": "##uelo"},
|
||||
{"entity": "I-PER", "index": 3, "score": 0.9993102550506592, "word": "Ara"},
|
||||
{"entity": "I-PER", "index": 4, "score": 0.9993743896484375, "word": "##új"},
|
||||
{"entity": "I-PER", "index": 5, "score": 0.9992871880531311, "word": "##o"},
|
||||
{"entity": "I-PER", "index": 6, "score": 0.9993029236793518, "word": "No"},
|
||||
{"entity": "I-PER", "index": 7, "score": 0.9981776475906372, "word": "##guera"},
|
||||
{"entity": "B-PER", "index": 15, "score": 0.9998136162757874, "word": "Andrés"},
|
||||
{"entity": "I-PER", "index": 16, "score": 0.999740719795227, "word": "Pas"},
|
||||
{"entity": "I-PER", "index": 17, "score": 0.9997414350509644, "word": "##tran"},
|
||||
{"entity": "I-PER", "index": 18, "score": 0.9996136426925659, "word": "##a"},
|
||||
{"entity": "B-ORG", "index": 28, "score": 0.9989739060401917, "word": "Far"},
|
||||
{"entity": "I-ORG", "index": 29, "score": 0.7188422083854675, "word": "##c"},
|
||||
],
|
||||
[
|
||||
{"entity": "I-PER", "index": 1, "score": 0.9968166351318359, "word": "En"},
|
||||
{"entity": "I-PER", "index": 2, "score": 0.9957635998725891, "word": "##zo"},
|
||||
{"entity": "I-ORG", "index": 7, "score": 0.9986497163772583, "word": "UN"},
|
||||
],
|
||||
]
|
||||
expected_grouped_ner_results = [
|
||||
[
|
||||
{"entity_group": "B-PER", "score": 0.9710702640669686, "word": "Consuelo Araújo Noguera"},
|
||||
{"entity_group": "B-PER", "score": 0.9997273534536362, "word": "Andrés Pastrana"},
|
||||
{"entity_group": "B-ORG", "score": 0.8589080572128296, "word": "Farc"},
|
||||
],
|
||||
[
|
||||
{"entity_group": "I-PER", "score": 0.9962901175022125, "word": "Enzo"},
|
||||
{"entity_group": "I-ORG", "score": 0.9986497163772583, "word": "UN"},
|
||||
],
|
||||
]
|
||||
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(VALID_INPUTS[0])
|
||||
self.assertIsInstance(mono_result, list)
|
||||
self.assertIsInstance(mono_result[0], (dict, list))
|
||||
|
||||
if isinstance(mono_result[0], list):
|
||||
mono_result = mono_result[0]
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result[0])
|
||||
|
||||
multi_result = [nlp(input) for input in VALID_INPUTS]
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], (dict, list))
|
||||
|
||||
if isinstance(multi_result[0], list):
|
||||
multi_result = multi_result[0]
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
for ungrouped_input, grouped_result in zip(ungrouped_ner_inputs, expected_grouped_ner_results):
|
||||
self.assertEqual(nlp.group_entities(ungrouped_input), grouped_result)
|
||||
|
||||
@require_tf
|
||||
def test_tf_only(self):
|
||||
model_name = "Narsil/small" # This model only has a TensorFlow version
|
||||
# We test that if we don't specificy framework='tf', it gets detected automatically
|
||||
nlp = pipeline(task="ner", model=model_name, tokenizer=model_name)
|
||||
self._test_pipeline(nlp)
|
|
@ -0,0 +1,47 @@
|
|||
import unittest
|
||||
|
||||
from transformers.pipelines import Pipeline
|
||||
|
||||
from .test_pipelines_common import CustomInputPipelineCommonMixin
|
||||
|
||||
|
||||
class QAPipelineTests(CustomInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "question-answering"
|
||||
small_models = [
|
||||
"sshleifer/tiny-distilbert-base-cased-distilled-squad"
|
||||
] # Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
output_keys = {"score", "answer", "start", "end"}
|
||||
valid_inputs = [
|
||||
{"question": "Where was HuggingFace founded ?", "context": "HuggingFace was founded in Paris."},
|
||||
{
|
||||
"question": "In what field is HuggingFace working ?",
|
||||
"context": "HuggingFace is a startup based in New-York founded in Paris which is trying to solve NLP.",
|
||||
},
|
||||
]
|
||||
invalid_inputs = [
|
||||
{"question": "", "context": "This is a test to try empty question edge case"},
|
||||
{"question": None, "context": "This is a test to try empty question edge case"},
|
||||
{"question": "What is does with empty context ?", "context": ""},
|
||||
{"question": "What is does with empty context ?", "context": None},
|
||||
]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
mono_result = nlp(valid_inputs[0])
|
||||
self.assertIsInstance(mono_result, dict)
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result)
|
||||
|
||||
multi_result = nlp(valid_inputs)
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], dict)
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, bad_input)
|
||||
self.assertRaises(Exception, nlp, invalid_inputs)
|
|
@ -0,0 +1,12 @@
|
|||
import unittest
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
class SentimentAnalysisPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "sentiment-analysis"
|
||||
small_models = [
|
||||
"sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english"
|
||||
] # Default model - Models tested without the @slow decorator
|
||||
large_models = [None] # Models tested with the @slow decorator
|
||||
mandatory_keys = {"label", "score"} # Keys which should be in the output
|
|
@ -0,0 +1,30 @@
|
|||
import unittest
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
DEFAULT_DEVICE_NUM = -1 if torch_device == "cpu" else 0
|
||||
|
||||
|
||||
class SummarizationPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "summarization"
|
||||
pipeline_running_kwargs = {"num_beams": 2, "min_length": 2, "max_length": 5}
|
||||
small_models = [
|
||||
"patrickvonplaten/t5-tiny-random",
|
||||
"sshleifer/bart-tiny-random",
|
||||
] # Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["summary_text"]
|
||||
|
||||
@require_torch
|
||||
@slow
|
||||
def test_integration_torch_summarization(self):
|
||||
nlp = pipeline(task="summarization", device=DEFAULT_DEVICE_NUM)
|
||||
cnn_article = ' (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday\'s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court\'s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What\'s objectionable is the attempts to undermine international justice, not Palestine\'s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court\'s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes. CNN\'s Vasco Cotovio, Kareem Khadder and Faith Karimi contributed to this report.'
|
||||
expected_cnn_summary = " The Palestinian Authority becomes the 123rd member of the International Criminal Court . The move gives the court jurisdiction over alleged crimes in Palestinian territories . Israel and the United States opposed the Palestinians' efforts to join the court . Rights group Human Rights Watch welcomes the move, says governments seeking to penalize Palestine should end pressure ."
|
||||
result = nlp(cnn_article)
|
||||
self.assertEqual(result[0]["summary_text"], expected_cnn_summary)
|
|
@ -0,0 +1,11 @@
|
|||
import unittest
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
class Text2TextGenerationPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "text2text-generation"
|
||||
small_models = ["patrickvonplaten/t5-tiny-random"] # Default model - Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["generated_text"]
|
|
@ -0,0 +1,10 @@
|
|||
import unittest
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
class TextGenerationPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "text-generation"
|
||||
pipeline_running_kwargs = {"prefix": "This is "}
|
||||
small_models = ["sshleifer/tiny-ctrl"] # Models tested without the @slow decorator
|
||||
large_models = [] # Models tested with the @slow decorator
|
|
@ -0,0 +1,54 @@
|
|||
import unittest
|
||||
|
||||
import pytest
|
||||
|
||||
from transformers import pipeline
|
||||
from transformers.testing_utils import is_pipeline_test, require_torch, slow
|
||||
|
||||
from .test_pipelines_common import MonoInputPipelineCommonMixin
|
||||
|
||||
|
||||
class TranslationEnToDePipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "translation_en_to_de"
|
||||
small_models = ["patrickvonplaten/t5-tiny-random"] # Default model - Models tested without the @slow decorator
|
||||
large_models = [None] # Models tested with the @slow decorator
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["translation_text"]
|
||||
|
||||
|
||||
class TranslationEnToRoPipelineTests(MonoInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "translation_en_to_ro"
|
||||
small_models = ["patrickvonplaten/t5-tiny-random"] # Default model - Models tested without the @slow decorator
|
||||
large_models = [None] # Models tested with the @slow decorator
|
||||
invalid_inputs = [4, "<mask>"]
|
||||
mandatory_keys = ["translation_text"]
|
||||
|
||||
|
||||
@is_pipeline_test
|
||||
class TranslationNewFormatPipelineTests(unittest.TestCase):
|
||||
@require_torch
|
||||
@slow
|
||||
def test_default_translations(self):
|
||||
# We don't provide a default for this pair
|
||||
with self.assertRaises(ValueError):
|
||||
pipeline(task="translation_cn_to_ar")
|
||||
|
||||
# but we do for this one
|
||||
pipeline(task="translation_en_to_de")
|
||||
|
||||
@require_torch
|
||||
def test_translation_on_odd_language(self):
|
||||
model = "patrickvonplaten/t5-tiny-random"
|
||||
pipeline(task="translation_cn_to_ar", model=model)
|
||||
|
||||
@require_torch
|
||||
def test_translation_default_language_selection(self):
|
||||
model = "patrickvonplaten/t5-tiny-random"
|
||||
with pytest.warns(UserWarning, match=r".*translation_en_to_de.*"):
|
||||
nlp = pipeline(task="translation", model=model)
|
||||
self.assertEqual(nlp.task, "translation_en_to_de")
|
||||
|
||||
@require_torch
|
||||
def test_translation_with_no_language_no_model_fails(self):
|
||||
with self.assertRaises(ValueError):
|
||||
pipeline(task="translation")
|
|
@ -0,0 +1,120 @@
|
|||
import unittest
|
||||
|
||||
from transformers.pipelines import Pipeline
|
||||
|
||||
from .test_pipelines_common import CustomInputPipelineCommonMixin
|
||||
|
||||
|
||||
class ZeroShotClassificationPipelineTests(CustomInputPipelineCommonMixin, unittest.TestCase):
|
||||
pipeline_task = "zero-shot-classification"
|
||||
small_models = [
|
||||
"sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english"
|
||||
] # Models tested without the @slow decorator
|
||||
large_models = ["roberta-large-mnli"] # Models tested with the @slow decorator
|
||||
|
||||
def _test_scores_sum_to_one(self, result):
|
||||
sum = 0.0
|
||||
for score in result["scores"]:
|
||||
sum += score
|
||||
self.assertAlmostEqual(sum, 1.0)
|
||||
|
||||
def _test_pipeline(self, nlp: Pipeline):
|
||||
output_keys = {"sequence", "labels", "scores"}
|
||||
valid_mono_inputs = [
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": "politics"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ["politics"]},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": "politics, public health"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ["politics", "public health"]},
|
||||
{"sequences": ["Who are you voting for in 2020?"], "candidate_labels": "politics"},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "This text is about {}",
|
||||
},
|
||||
]
|
||||
valid_multi_input = {
|
||||
"sequences": ["Who are you voting for in 2020?", "What is the capital of Spain?"],
|
||||
"candidate_labels": "politics",
|
||||
}
|
||||
invalid_inputs = [
|
||||
{"sequences": None, "candidate_labels": "politics"},
|
||||
{"sequences": "", "candidate_labels": "politics"},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": None},
|
||||
{"sequences": "Who are you voting for in 2020?", "candidate_labels": ""},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": None,
|
||||
},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "",
|
||||
},
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": "politics",
|
||||
"hypothesis_template": "Template without formatting syntax.",
|
||||
},
|
||||
]
|
||||
self.assertIsNotNone(nlp)
|
||||
|
||||
for mono_input in valid_mono_inputs:
|
||||
mono_result = nlp(**mono_input)
|
||||
self.assertIsInstance(mono_result, dict)
|
||||
if len(mono_result["labels"]) > 1:
|
||||
self._test_scores_sum_to_one(mono_result)
|
||||
|
||||
for key in output_keys:
|
||||
self.assertIn(key, mono_result)
|
||||
|
||||
multi_result = nlp(**valid_multi_input)
|
||||
self.assertIsInstance(multi_result, list)
|
||||
self.assertIsInstance(multi_result[0], dict)
|
||||
self.assertEqual(len(multi_result), len(valid_multi_input["sequences"]))
|
||||
|
||||
for result in multi_result:
|
||||
for key in output_keys:
|
||||
self.assertIn(key, result)
|
||||
|
||||
if len(result["labels"]) > 1:
|
||||
self._test_scores_sum_to_one(result)
|
||||
|
||||
for bad_input in invalid_inputs:
|
||||
self.assertRaises(Exception, nlp, **bad_input)
|
||||
|
||||
if nlp.model.name_or_path in self.large_models:
|
||||
# We also check the outputs for the large models
|
||||
inputs = [
|
||||
{
|
||||
"sequences": "Who are you voting for in 2020?",
|
||||
"candidate_labels": ["politics", "public health", "science"],
|
||||
},
|
||||
{
|
||||
"sequences": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
|
||||
"candidate_labels": ["machine learning", "statistics", "translation", "vision"],
|
||||
"multi_class": True,
|
||||
},
|
||||
]
|
||||
|
||||
expected_outputs = [
|
||||
{
|
||||
"sequence": "Who are you voting for in 2020?",
|
||||
"labels": ["politics", "public health", "science"],
|
||||
"scores": [0.975, 0.015, 0.008],
|
||||
},
|
||||
{
|
||||
"sequence": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
|
||||
"labels": ["translation", "machine learning", "vision", "statistics"],
|
||||
"scores": [0.817, 0.712, 0.018, 0.017],
|
||||
},
|
||||
]
|
||||
|
||||
for input, expected_output in zip(inputs, expected_outputs):
|
||||
output = nlp(**input)
|
||||
for key in output:
|
||||
if key == "scores":
|
||||
for output_score, expected_score in zip(output[key], expected_output[key]):
|
||||
self.assertAlmostEqual(output_score, expected_score, places=2)
|
||||
else:
|
||||
self.assertEqual(output[key], expected_output[key])
|
|
@ -25,7 +25,14 @@ from itertools import takewhile
|
|||
from typing import TYPE_CHECKING, Dict, List, Tuple, Union
|
||||
|
||||
from transformers import PreTrainedTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizerFast, is_torch_available
|
||||
from transformers.testing_utils import get_tests_dir, require_tf, require_tokenizers, require_torch, slow
|
||||
from transformers.testing_utils import (
|
||||
get_tests_dir,
|
||||
is_pt_tf_cross_test,
|
||||
require_tf,
|
||||
require_tokenizers,
|
||||
require_torch,
|
||||
slow,
|
||||
)
|
||||
from transformers.tokenization_utils import AddedToken
|
||||
|
||||
|
||||
|
@ -1517,8 +1524,7 @@ class TokenizerTesterMixin:
|
|||
string_sequences, return_overflowing_tokens=True, truncation=True, padding=True, max_length=3
|
||||
)
|
||||
|
||||
@require_torch
|
||||
@require_tf
|
||||
@is_pt_tf_cross_test
|
||||
def test_batch_encode_plus_tensors(self):
|
||||
tokenizers = self.get_tokenizers(do_lower_case=False)
|
||||
for tokenizer in tokenizers:
|
||||
|
|
Loading…
Reference in New Issue