6.6 KiB
Processors
Processors can mean two different things in the Transformers library:
- the objects that pre-process inputs for multi-modal models such as Wav2Vec2 (speech and text) or CLIP (text and vision)
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
Multi-modal processors
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, vision and audio). This is handled by objects called processors, which group together two or more processing objects such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).
Those processors inherit from the following base class that implements the saving and loading functionality:
autodoc ProcessorMixin
Deprecated processors
All processors follow the same architecture which is that of the
[~data.processors.utils.DataProcessor
]. The processor returns a list of
[~data.processors.utils.InputExample
]. These
[~data.processors.utils.InputExample
] can be converted to
[~data.processors.utils.InputFeatures
] in order to be fed to the model.
autodoc data.processors.utils.DataProcessor
autodoc data.processors.utils.InputExample
autodoc data.processors.utils.InputFeatures
GLUE
General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
Those processors are:
- [
~data.processors.utils.MrpcProcessor
] - [
~data.processors.utils.MnliProcessor
] - [
~data.processors.utils.MnliMismatchedProcessor
] - [
~data.processors.utils.Sst2Processor
] - [
~data.processors.utils.StsbProcessor
] - [
~data.processors.utils.QqpProcessor
] - [
~data.processors.utils.QnliProcessor
] - [
~data.processors.utils.RteProcessor
] - [
~data.processors.utils.WnliProcessor
]
Additionally, the following method can be used to load values from a data file and convert them to a list of
[~data.processors.utils.InputExample
].
autodoc data.processors.glue.glue_convert_examples_to_features
XNLI
The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations
This library hosts the processor to load the XNLI data:
- [
~data.processors.utils.XnliProcessor
]
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the run_xnli.py script.
SQuAD
The Stanford Question Answering Dataset (SQuAD) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The second version (v2.0) was released alongside the paper Know What You Don't Know: Unanswerable Questions for SQuAD.
This library hosts a processor for each of the two versions:
Processors
Those processors are:
- [
~data.processors.utils.SquadV1Processor
] - [
~data.processors.utils.SquadV2Processor
]
They both inherit from the abstract class [~data.processors.utils.SquadProcessor
]
autodoc data.processors.squad.SquadProcessor - all
Additionally, the following method can be used to convert SQuAD examples into
[~data.processors.utils.SquadFeatures
] that can be used as model inputs.
autodoc data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementioned method can be used with files containing the data as well as with the tensorflow_datasets package. Examples are given below.
Example usage
Here is an example using the processors as well as the conversion method using data files:
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Using tensorflow_datasets is as easy as using a data file:
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Another example using these processors is given in the run_squad.py script.