Use text_column_name variable instead of "text" (#12132)

* Use text_column_name variable instead of "text"

`text_column_name` was already defined above where I made the changes and it was also used below where I made changes.

This is a very minor change. If a dataset does not use "text" as the column name, then the `tokenize_function` will now use whatever column is assigned to `text_column_name`. `text_column_name` is just the first column name if "text" is not a column name. It makes the function a little more robust, though I would assume that 90% + of datasets use "text" anyway.

* black formatting

* make style

Co-authored-by: Nicholas Broad <nicholas@nmbroad.com>
This commit is contained in:
Nicholas Broad 2021-06-14 08:11:13 -04:00 committed by GitHub
parent b8ab541340
commit cd7961b632
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 8 additions and 4 deletions

View File

@ -345,9 +345,11 @@ def main():
def tokenize_function(examples):
# Remove empty lines
examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
examples[text_column_name] = [
line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
]
return tokenizer(
examples["text"],
examples[text_column_name],
padding=padding,
truncation=True,
max_length=max_seq_length,

View File

@ -327,9 +327,11 @@ def main():
def tokenize_function(examples):
# Remove empty lines
examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
examples[text_column_name] = [
line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
]
return tokenizer(
examples["text"],
examples[text_column_name],
padding=padding,
truncation=True,
max_length=max_seq_length,