diff --git a/README.md b/README.md
index f5b6c4e..b8586f8 100644
--- a/README.md
+++ b/README.md
@@ -289,6 +289,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
@@ -324,6 +325,8 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
 1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
 1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
+1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
@@ -331,9 +334,11 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
 1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
+1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet
index 7c66a5e..9f9e69c 100644
--- a/docs/snippets/6_supported-models.snippet
+++ b/docs/snippets/6_supported-models.snippet
@@ -24,6 +24,7 @@
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
@@ -59,6 +60,8 @@
 1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
 1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
 1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
+1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
 1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
@@ -66,9 +69,11 @@
 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
 1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
+1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
diff --git a/scripts/convert.py b/scripts/convert.py
index b48e1ed..631f69b 100644
--- a/scripts/convert.py
+++ b/scripts/convert.py
@@ -334,7 +334,15 @@ def main():
 
             with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
                 json.dump(tokenizer_json, fp, indent=4)
+    
+    elif config.model_type == 'vits':
+        if tokenizer is not None:
+            from .extra.vits import generate_tokenizer_json
+            tokenizer_json = generate_tokenizer_json(tokenizer)
 
+            with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
+                json.dump(tokenizer_json, fp, indent=4)
+    
     elif config.model_type == 'speecht5':
         # TODO allow user to specify vocoder path
         export_kwargs["model_kwargs"] = {"vocoder": "microsoft/speecht5_hifigan"}
diff --git a/scripts/extra/vits.py b/scripts/extra/vits.py
new file mode 100644
index 0000000..d4b75f2
--- /dev/null
+++ b/scripts/extra/vits.py
@@ -0,0 +1,100 @@
+
+
+def generate_tokenizer_json(tokenizer):
+    vocab = tokenizer.get_vocab()
+
+    normalizers = []
+
+    if tokenizer.normalize:
+        # Lowercase the input string
+        normalizers.append({
+            "type": "Lowercase",
+        })
+
+    if tokenizer.language == 'ron':
+        # Replace diacritics
+        normalizers.append({
+            "type": "Replace",
+            "pattern": {
+                "String": "ț",
+            },
+            "content": "ţ",
+        })
+
+    if tokenizer.phonemize:
+        raise NotImplementedError("Phonemization is not implemented yet")
+
+    elif tokenizer.normalize:
+        # strip any chars outside of the vocab (punctuation)
+        chars = ''.join(x for x in vocab if len(x) == 1)
+        escaped = chars.replace('-', r'\-').replace(']', r'\]')
+        normalizers.append({
+            "type": "Replace",
+            "pattern": {
+                "Regex": f"[^{escaped}]",
+            },
+            "content": "",
+        })
+        normalizers.append({
+            "type": "Strip",
+            "strip_left": True,
+            "strip_right": True,
+        })
+
+    if tokenizer.add_blank:
+        # add pad token between each char
+        normalizers.append({
+            "type": "Replace",
+            "pattern": {
+                # Add a blank token between each char, except when blank (then do nothing)
+                "Regex": "(?=.)|(?<!^)$",
+            },
+            "content": tokenizer.pad_token,
+        })
+
+    if len(normalizers) == 0:
+        normalizer = None
+    elif len(normalizers) == 1:
+        normalizer = normalizers[0]
+    else:
+        normalizer = {
+            "type": "Sequence",
+            "normalizers": normalizers,
+        }
+
+    tokenizer_json = {
+        "version": "1.0",
+        "truncation": None,
+        "padding": None,
+        "added_tokens": [
+            {
+                "id": vocab[token],
+                "content": token,
+                "single_word": False,
+                "lstrip": False,
+                "rstrip": False,
+                "normalized": False,
+                "special": True
+            }
+            for token in vocab
+
+            # `tokenizer.pad_token` should not be considered an added token
+            if token in (tokenizer.unk_token, )
+        ],
+        "normalizer": normalizer,
+        "pre_tokenizer": {
+            "type": "Split",
+            "pattern": {
+                "Regex": ""
+            },
+            "behavior": "Isolated",
+            "invert": False
+        },
+        "post_processor": None,
+        "decoder": None,  # Custom decoder implemented in JS
+        "model": {
+            "vocab": vocab
+        },
+    }
+
+    return tokenizer_json
diff --git a/scripts/supported_models.py b/scripts/supported_models.py
index bbc241e..6aa62a7 100644
--- a/scripts/supported_models.py
+++ b/scripts/supported_models.py
@@ -362,6 +362,20 @@ SUPPORTED_MODELS = {
             'distilbert-base-cased',
         ],
     },
+    'dit': {  # NOTE: DiT has the same architecture as BEiT.
+        # Feature extraction
+        # NOTE: requires --task feature-extraction
+        'feature-extraction': [
+            'microsoft/dit-base',
+            'microsoft/dit-large',
+        ],
+
+        # Image classification
+        'image-classification': [
+            'microsoft/dit-base-finetuned-rvlcdip',
+            'microsoft/dit-large-finetuned-rvlcdip',
+        ],
+    },
     'donut': {  # NOTE: also a `vision-encoder-decoder`
         # Image-to-text
         'image-to-text': [
@@ -650,6 +664,44 @@ SUPPORTED_MODELS = {
             'microsoft/resnet-152',
         ],
     },
+    'roformer': {
+        # Feature extraction
+        'feature-extraction': [
+            'hf-tiny-model-private/tiny-random-RoFormerModel',
+        ],
+
+        # Text classification
+        'text-classification': [
+            'hf-tiny-model-private/tiny-random-RoFormerForSequenceClassification',
+        ],
+
+        # Token classification
+        'token-classification': [
+            'hf-tiny-model-private/tiny-random-RoFormerForTokenClassification',
+        ],
+
+        # TODO
+        # # Text generation
+        # 'text-generation': [
+        #     'hf-tiny-model-private/tiny-random-RoFormerForCausalLM',
+        # ],
+
+        # Masked language modelling
+        'fill-mask': [
+            'alchemab/antiberta2',
+            'hf-tiny-model-private/tiny-random-RoFormerForMaskedLM',
+        ],
+
+        # Question answering
+        'question-answering': [
+            'hf-tiny-model-private/tiny-random-RoFormerForQuestionAnswering',
+        ],
+
+        # Multiple choice
+        'multiple-choice': [
+            'hf-tiny-model-private/tiny-random-RoFormerForMultipleChoice',
+        ],
+    },
     'phi': {
         # Text generation
         'text-generation': [
@@ -684,6 +736,40 @@ SUPPORTED_MODELS = {
     #     'facebook/sam-vit-large',
     #     'facebook/sam-vit-huge',
     # ],
+    'segformer': {
+        # Image segmentation
+        'image-segmentation': [
+            'mattmdjaga/segformer_b0_clothes',
+            'mattmdjaga/segformer_b2_clothes',
+            'jonathandinu/face-parsing',
+
+            'nvidia/segformer-b0-finetuned-cityscapes-768-768',
+            'nvidia/segformer-b0-finetuned-cityscapes-512-1024',
+            'nvidia/segformer-b0-finetuned-cityscapes-640-1280',
+            'nvidia/segformer-b0-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b1-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b2-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b3-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b4-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b5-finetuned-cityscapes-1024-1024',
+            'nvidia/segformer-b0-finetuned-ade-512-512',
+            'nvidia/segformer-b1-finetuned-ade-512-512',
+            'nvidia/segformer-b2-finetuned-ade-512-512',
+            'nvidia/segformer-b3-finetuned-ade-512-512',
+            'nvidia/segformer-b4-finetuned-ade-512-512',
+            'nvidia/segformer-b5-finetuned-ade-640-640',
+        ],
+
+        # Image classification
+        'image-classification': [
+            'nvidia/mit-b0',
+            'nvidia/mit-b1',
+            'nvidia/mit-b2',
+            'nvidia/mit-b3',
+            'nvidia/mit-b4',
+            'nvidia/mit-b5',
+        ],
+    },
     'siglip': {
         # Zero-shot image classification and feature extraction
         # (with and without `--split_modalities`)
@@ -754,6 +840,8 @@ SUPPORTED_MODELS = {
             'MBZUAI/LaMini-T5-61M',
             'MBZUAI/LaMini-T5-223M',
             'MBZUAI/LaMini-T5-738M',
+            'declare-lab/flan-alpaca-base',
+            'declare-lab/flan-alpaca-large',
         ],
 
         # Feature extraction
@@ -763,6 +851,16 @@ SUPPORTED_MODELS = {
             'hkunlp/instructor-large',
         ],
     },
+    'table-transformer': {
+        # Object detection
+        'object-detection': [
+            'microsoft/table-transformer-detection',
+            'microsoft/table-transformer-structure-recognition',
+            'microsoft/table-transformer-structure-recognition-v1.1-all',
+            'microsoft/table-transformer-structure-recognition-v1.1-fin',
+            'microsoft/table-transformer-structure-recognition-v1.1-pub',
+        ],
+    },
     'trocr': {  # NOTE: also a `vision-encoder-decoder`
         # Text-to-image
         'text-to-image': [
@@ -801,6 +899,27 @@ SUPPORTED_MODELS = {
             'hustvl/vitmatte-base-composition-1k',
         ],
     },
+    'vits': {
+        # Text-to-audio/Text-to-speech/Text-to-waveform
+        'text-to-waveform': {
+            # NOTE: requires --task text-to-waveform --skip_validation
+            'echarlaix/tiny-random-vits',
+            'facebook/mms-tts-eng',
+            'facebook/mms-tts-rus',
+            'facebook/mms-tts-hin',
+            'facebook/mms-tts-yor',
+            'facebook/mms-tts-spa',
+            'facebook/mms-tts-fra',
+            'facebook/mms-tts-ara',
+            'facebook/mms-tts-ron',
+            'facebook/mms-tts-vie',
+            'facebook/mms-tts-deu',
+            'facebook/mms-tts-kor',
+            'facebook/mms-tts-por',
+            # TODO add more checkpoints from
+            # https://huggingface.co/models?other=vits&sort=trending&search=facebook-tts
+        }
+    },
     'wav2vec2': {
         # Feature extraction # NOTE: requires --task feature-extraction
         'feature-extraction': [
diff --git a/src/models.js b/src/models.js
index 96cd181..aa0928f 100644
--- a/src/models.js
+++ b/src/models.js
@@ -1464,6 +1464,78 @@ export class BertForQuestionAnswering extends BertPreTrainedModel {
 }
 //////////////////////////////////////////////////
 
+//////////////////////////////////////////////////
+// RoFormer models
+export class RoFormerPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * The bare RoFormer Model transformer outputting raw hidden-states without any specific head on top.
+ */
+export class RoFormerModel extends RoFormerPreTrainedModel { }
+
+/**
+ * RoFormer Model with a `language modeling` head on top.
+ */
+export class RoFormerForMaskedLM extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<MaskedLMOutput>} An object containing the model's output logits for masked language modeling.
+     */
+    async _call(model_inputs) {
+        return new MaskedLMOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output)
+ */
+export class RoFormerForSequenceClassification extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<SequenceClassifierOutput>} An object containing the model's output logits for sequence classification.
+     */
+    async _call(model_inputs) {
+        return new SequenceClassifierOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model with a token classification head on top (a linear layer on top of the hidden-states output)
+ * e.g. for Named-Entity-Recognition (NER) tasks.
+ */
+export class RoFormerForTokenClassification extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<TokenClassifierOutput>} An object containing the model's output logits for token classification.
+     */
+    async _call(model_inputs) {
+        return new TokenClassifierOutput(await super._call(model_inputs));
+    }
+}
+
+/**
+ * RoFormer Model with a span classification head on top for extractive question-answering tasks like SQuAD
+ * (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+ */
+export class RoFormerForQuestionAnswering extends RoFormerPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     *
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<QuestionAnsweringModelOutput>} An object containing the model's output logits for question answering.
+     */
+    async _call(model_inputs) {
+        return new QuestionAnsweringModelOutput(await super._call(model_inputs));
+    }
+}
+// TODO: Add RoFormerForCausalLM and RoFormerForMultipleChoice
+//////////////////////////////////////////////////
 
 //////////////////////////////////////////////////
 // ConvBert models
@@ -3725,6 +3797,30 @@ export class DetrSegmentationOutput extends ModelOutput {
 }
 //////////////////////////////////////////////////
 
+//////////////////////////////////////////////////
+export class TableTransformerPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * The bare Table Transformer Model (consisting of a backbone and encoder-decoder Transformer)
+ * outputting raw hidden-states without any specific head on top.
+ */
+export class TableTransformerModel extends TableTransformerPreTrainedModel { }
+
+/**
+ * Table Transformer Model (consisting of a backbone and encoder-decoder Transformer)
+ * with object detection heads on top, for tasks such as COCO detection.
+ */
+export class TableTransformerForObjectDetection extends TableTransformerPreTrainedModel {
+    /**
+     * @param {any} model_inputs
+     */
+    async _call(model_inputs) {
+        return new TableTransformerObjectDetectionOutput(await super._call(model_inputs));
+    }
+}
+export class TableTransformerObjectDetectionOutput extends DetrObjectDetectionOutput { }
+//////////////////////////////////////////////////
+
 
 //////////////////////////////////////////////////
 export class DeiTPreTrainedModel extends PreTrainedModel { }
@@ -4719,6 +4815,68 @@ export class ClapAudioModelWithProjection extends ClapPreTrainedModel {
 //////////////////////////////////////////////////
 
 
+//////////////////////////////////////////////////
+// VITS models
+export class VitsPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * The complete VITS model, for text-to-speech synthesis.
+ * 
+ * **Example:** Generate speech from text with `VitsModel`.
+ * ```javascript
+ * import { AutoTokenizer, VitsModel } from '@xenova/transformers';
+ * 
+ * // Load the tokenizer and model
+ * const tokenizer = await AutoTokenizer.from_pretrained('Xenova/mms-tts-eng');
+ * const model = await VitsModel.from_pretrained('Xenova/mms-tts-eng');
+ * 
+ * // Run tokenization
+ * const inputs = tokenizer('I love transformers');
+ * 
+ * // Generate waveform
+ * const { waveform } = await model(inputs);
+ * // Tensor {
+ * //   dims: [ 1, 35328 ],
+ * //   type: 'float32',
+ * //   data: Float32Array(35328) [ ... ],
+ * //   size: 35328,
+ * // }
+ * ```
+ */
+export class VitsModel extends VitsPreTrainedModel {
+    /**
+     * Calls the model on new inputs.
+     * @param {Object} model_inputs The inputs to the model.
+     * @returns {Promise<VitsModelOutput>} The outputs for the VITS model.
+     */
+    async _call(model_inputs) {
+        return new VitsModelOutput(await super._call(model_inputs));
+    }
+}
+//////////////////////////////////////////////////
+
+//////////////////////////////////////////////////
+// Segformer models
+export class SegformerPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * The bare SegFormer encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top.
+ */
+export class SegformerModel extends SegformerPreTrainedModel { }
+
+/**
+ * SegFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden states) e.g. for ImageNet.
+ */
+export class SegformerForImageClassification extends SegformerPreTrainedModel { }
+
+/**
+ * SegFormer Model transformer with an all-MLP decode head on top e.g. for ADE20k, CityScapes.
+ */
+export class SegformerForSemanticSegmentation extends SegformerPreTrainedModel { }
+
+//////////////////////////////////////////////////
+
+
 //////////////////////////////////////////////////
 // AutoModels, used to simplify construction of PreTrainedModels
 // (uses config to instantiate correct class)
@@ -4790,6 +4948,7 @@ export class PretrainedMixin {
 
 const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
     ['bert', ['BertModel', BertModel]],
+    ['roformer', ['RoFormerModel', RoFormerModel]],
     ['electra', ['ElectraModel', ElectraModel]],
     ['esm', ['EsmModel', EsmModel]],
     ['convbert', ['ConvBertModel', ConvBertModel]],
@@ -4812,8 +4971,10 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
     ['hubert', ['HubertModel', HubertModel]],
     ['wavlm', ['WavLMModel', WavLMModel]],
     ['audio-spectrogram-transformer', ['ASTModel', ASTModel]],
+    ['vits', ['VitsModel', VitsModel]],
 
     ['detr', ['DetrModel', DetrModel]],
+    ['table-transformer', ['TableTransformerModel', TableTransformerModel]],
     ['vit', ['ViTModel', ViTModel]],
     ['mobilevit', ['MobileViTModel', MobileViTModel]],
     ['owlvit', ['OwlViTModel', OwlViTModel]],
@@ -4868,14 +5029,19 @@ const MODEL_MAPPING_NAMES_DECODER_ONLY = new Map([
 const MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = new Map([
     ['speecht5', ['SpeechT5ForSpeechToText', SpeechT5ForSpeechToText]],
     ['whisper', ['WhisperForConditionalGeneration', WhisperForConditionalGeneration]],
-])
+]);
 
 const MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES = new Map([
     ['speecht5', ['SpeechT5ForTextToSpeech', SpeechT5ForTextToSpeech]],
-])
+]);
+
+const MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = new Map([
+    ['vits', ['VitsModel', VitsModel]],
+]);
 
 const MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = new Map([
     ['bert', ['BertForSequenceClassification', BertForSequenceClassification]],
+    ['roformer', ['RoFormerForSequenceClassification', RoFormerForSequenceClassification]],
     ['electra', ['ElectraForSequenceClassification', ElectraForSequenceClassification]],
     ['esm', ['EsmForSequenceClassification', EsmForSequenceClassification]],
     ['convbert', ['ConvBertForSequenceClassification', ConvBertForSequenceClassification]],
@@ -4896,6 +5062,7 @@ const MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = new Map([
     ['bert', ['BertForTokenClassification', BertForTokenClassification]],
+    ['roformer', ['RoFormerForTokenClassification', RoFormerForTokenClassification]],
     ['electra', ['ElectraForTokenClassification', ElectraForTokenClassification]],
     ['esm', ['EsmForTokenClassification', EsmForTokenClassification]],
     ['convbert', ['ConvBertForTokenClassification', ConvBertForTokenClassification]],
@@ -4941,6 +5108,7 @@ const MODEL_WITH_LM_HEAD_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([
     ['bert', ['BertForMaskedLM', BertForMaskedLM]],
+    ['roformer', ['RoFormerForMaskedLM', RoFormerForMaskedLM]],
     ['electra', ['ElectraForMaskedLM', ElectraForMaskedLM]],
     ['esm', ['EsmForMaskedLM', EsmForMaskedLM]],
     ['convbert', ['ConvBertForMaskedLM', ConvBertForMaskedLM]],
@@ -4959,6 +5127,7 @@ const MODEL_FOR_MASKED_LM_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
     ['bert', ['BertForQuestionAnswering', BertForQuestionAnswering]],
+    ['roformer', ['RoFormerForQuestionAnswering', RoFormerForQuestionAnswering]],
     ['electra', ['ElectraForQuestionAnswering', ElectraForQuestionAnswering]],
     ['convbert', ['ConvBertForQuestionAnswering', ConvBertForQuestionAnswering]],
     ['camembert', ['CamembertForQuestionAnswering', CamembertForQuestionAnswering]],
@@ -4992,10 +5161,12 @@ const MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = new Map([
     ['dinov2', ['Dinov2ForImageClassification', Dinov2ForImageClassification]],
     ['resnet', ['ResNetForImageClassification', ResNetForImageClassification]],
     ['swin', ['SwinForImageClassification', SwinForImageClassification]],
+    ['segformer', ['SegformerForImageClassification', SegformerForImageClassification]],
 ]);
 
 const MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = new Map([
     ['detr', ['DetrForObjectDetection', DetrForObjectDetection]],
+    ['table-transformer', ['TableTransformerForObjectDetection', TableTransformerForObjectDetection]],
     ['yolos', ['YolosForObjectDetection', YolosForObjectDetection]],
 ]);
 
@@ -5007,6 +5178,10 @@ const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
     ['detr', ['DetrForSegmentation', DetrForSegmentation]],
 ]);
 
+const MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES = new Map([
+    ['segformer', ['SegformerForSemanticSegmentation', SegformerForSemanticSegmentation]],
+]);
+
 const MODEL_FOR_MASK_GENERATION_MAPPING_NAMES = new Map([
     ['sam', ['SamModel', SamModel]],
 ]);
@@ -5052,6 +5227,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
     [MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES, MODEL_TYPES.Vision2Seq],
     [MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
+    [MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
@@ -5061,6 +5237,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
     [MODEL_FOR_CTC_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
     [MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES, MODEL_TYPES.Seq2Seq],
+    [MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
 ];
 
 for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
@@ -5154,6 +5331,17 @@ export class AutoModelForTextToSpectrogram extends PretrainedMixin {
     static MODEL_CLASS_MAPPINGS = [MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES];
 }
 
+/**
+ * Helper class which is used to instantiate pretrained text-to-waveform models with the `from_pretrained` function.
+ * The chosen model class is determined by the type specified in the model config.
+ * 
+ * @example
+ * let model = await AutoModelForTextToSpectrogram.from_pretrained('facebook/mms-tts-eng');
+ */
+export class AutoModelForTextToWaveform extends PretrainedMixin {
+    static MODEL_CLASS_MAPPINGS = [MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES];
+}
+
 /**
  * Helper class which is used to instantiate pretrained causal language models with the `from_pretrained` function.
  * The chosen model class is determined by the type specified in the model config.
@@ -5220,6 +5408,17 @@ export class AutoModelForImageSegmentation extends PretrainedMixin {
     static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES];
 }
 
+/**
+ * Helper class which is used to instantiate pretrained image segmentation models with the `from_pretrained` function.
+ * The chosen model class is determined by the type specified in the model config.
+ * 
+ * @example
+ * let model = await AutoModelForSemanticSegmentation.from_pretrained('nvidia/segformer-b3-finetuned-cityscapes-1024-1024');
+ */
+export class AutoModelForSemanticSegmentation extends PretrainedMixin {
+    static MODEL_CLASS_MAPPINGS = [MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES];
+}
+
 /**
  * Helper class which is used to instantiate pretrained object detection models with the `from_pretrained` function.
  * The chosen model class is determined by the type specified in the model config.
@@ -5393,3 +5592,20 @@ export class ImageMattingOutput extends ModelOutput {
         this.alphas = alphas;
     }
 }
+
+/**
+ * Describes the outputs for the VITS model.
+ */
+export class VitsModelOutput extends ModelOutput {
+    /**
+     * @param {Object} output The output of the model.
+     * @param {Tensor} output.waveform The final audio waveform predicted by the model, of shape `(batch_size, sequence_length)`.
+     * @param {Tensor} output.spectrogram The log-mel spectrogram predicted at the output of the flow model.
+     * This spectrogram is passed to the Hi-Fi GAN decoder model to obtain the final audio waveform.
+     */
+    constructor({ waveform, spectrogram }) {
+        super();
+        this.waveform = waveform;
+        this.spectrogram = spectrogram;
+    }
+}
diff --git a/src/pipelines.js b/src/pipelines.js
index fab848b..6ed132f 100644
--- a/src/pipelines.js
+++ b/src/pipelines.js
@@ -26,18 +26,19 @@ import {
     AutoModelForMaskedLM,
     AutoModelForSeq2SeqLM,
     AutoModelForSpeechSeq2Seq,
+    AutoModelForTextToWaveform,
     AutoModelForTextToSpectrogram,
     AutoModelForCTC,
     AutoModelForCausalLM,
     AutoModelForVision2Seq,
     AutoModelForImageClassification,
     AutoModelForImageSegmentation,
+    AutoModelForSemanticSegmentation,
     AutoModelForObjectDetection,
     AutoModelForZeroShotObjectDetection,
     AutoModelForDocumentQuestionAnswering,
     AutoModelForImageToImage,
     AutoModelForDepthEstimation,
-    // AutoModelForTextToWaveform,
     PreTrainedModel,
 } from './models.js';
 import {
@@ -280,7 +281,7 @@ export class TokenClassificationPipeline extends Pipeline {
                 let tokenData = batch[j];
                 let topScoreIndex = max(tokenData.data)[1];
 
-                let entity = id2label[topScoreIndex];
+                let entity = id2label ? id2label[topScoreIndex] : `LABEL_${topScoreIndex}`;
                 if (ignore_labels.includes(entity)) {
                     // We predicted a token that should be ignored. So, we skip it.
                     continue;
@@ -1710,8 +1711,26 @@ export class ImageSegmentationPipeline extends Pipeline {
             }
 
         } else if (subtask === 'semantic') {
-            throw Error(`semantic segmentation not yet supported.`);
+            const { segmentation, labels } = fn(output, target_sizes ?? imageSizes)[0];
 
+            const id2label = this.model.config.id2label;
+
+            for (let label of labels) {
+                const maskData = new Uint8ClampedArray(segmentation.data.length);
+                for (let i = 0; i < segmentation.data.length; ++i) {
+                    if (segmentation.data[i] === label) {
+                        maskData[i] = 255;
+                    }
+                }
+
+                const mask = new RawImage(maskData, segmentation.dims[1], segmentation.dims[0], 1);
+
+                annotation.push({
+                    score: null,
+                    label: id2label[label],
+                    mask: mask
+                });
+            }
         } else {
             throw Error(`Subtask ${subtask} not supported.`);
         }
@@ -2117,6 +2136,16 @@ export class DocumentQuestionAnsweringPipeline extends Pipeline {
  * wav.fromScratch(1, out.sampling_rate, '32f', out.audio);
  * fs.writeFileSync('out.wav', wav.toBuffer());
  * ```
+ * 
+ * **Example:** Multilingual speech generation with `Xenova/mms-tts-fra`. See [here](https://huggingface.co/models?pipeline_tag=text-to-speech&other=vits&sort=trending) for the full list of available languages (1107).
+ * ```js
+ * let synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-fra');
+ * let out = await synthesizer('Bonjour');
+ * // {
+ * //   audio: Float32Array(23808) [-0.00037693005288019776, 0.0003325853613205254, ...],
+ * //   sampling_rate: 16000
+ * // }
+ * ```
  */
 export class TextToAudioPipeline extends Pipeline {
     DEFAULT_VOCODER_ID = "Xenova/speecht5_hifigan"
@@ -2148,6 +2177,34 @@ export class TextToAudioPipeline extends Pipeline {
     async _call(text_inputs, {
         speaker_embeddings = null,
     } = {}) {
+        // If this.processor is not set, we are using a `AutoModelForTextToWaveform` model
+        if (this.processor) {
+            return this._call_text_to_spectrogram(text_inputs, { speaker_embeddings });
+        } else {
+            return this._call_text_to_waveform(text_inputs);
+        }
+    }
+
+    async _call_text_to_waveform(text_inputs) {
+
+        // Run tokenization
+        const inputs = this.tokenizer(text_inputs, {
+            padding: true,
+            truncation: true
+        });
+
+        // Generate waveform
+        const { waveform } = await this.model(inputs);
+
+        const sampling_rate = this.model.config.sampling_rate;
+        return {
+            audio: waveform.data,
+            sampling_rate,
+        }
+    }
+
+    async _call_text_to_spectrogram(text_inputs, { speaker_embeddings }) {
+
         // Load vocoder, if not provided
         if (!this.vocoder) {
             console.log('No vocoder specified, using default HifiGan vocoder.');
@@ -2417,8 +2474,8 @@ const SUPPORTED_TASKS = {
     "text-to-audio": {
         "tokenizer": AutoTokenizer,
         "pipeline": TextToAudioPipeline,
-        "model": [ /* TODO: AutoModelForTextToWaveform, */ AutoModelForTextToSpectrogram],
-        "processor": AutoProcessor,
+        "model": [AutoModelForTextToWaveform, AutoModelForTextToSpectrogram],
+        "processor": [AutoProcessor, /* Some don't use a processor */ null],
         "default": {
             // TODO: replace with original
             // "model": "microsoft/speecht5_tts",
@@ -2455,7 +2512,7 @@ const SUPPORTED_TASKS = {
     "image-segmentation": {
         // no tokenizer
         "pipeline": ImageSegmentationPipeline,
-        "model": AutoModelForImageSegmentation,
+        "model": [AutoModelForImageSegmentation, AutoModelForSemanticSegmentation],
         "processor": AutoProcessor,
         "default": {
             // TODO: replace with original
@@ -2678,6 +2735,12 @@ async function loadItems(mapping, model, pretrainedOptions) {
             promise = new Promise(async (resolve, reject) => {
                 let e;
                 for (let c of cls) {
+                    if (c === null) {
+                        // If null, we resolve it immediately, meaning the relevant
+                        // class was not found, but it is optional.
+                        resolve(null);
+                        return;
+                    }
                     try {
                         resolve(await c.from_pretrained(model, pretrainedOptions));
                         return;
diff --git a/src/processors.js b/src/processors.js
index 30e56c8..4d31359 100644
--- a/src/processors.js
+++ b/src/processors.js
@@ -618,6 +618,71 @@ export class ImageFeatureExtractor extends FeatureExtractor {
 
 }
 
+export class SegformerFeatureExtractor extends ImageFeatureExtractor {
+
+    /**
+     * Converts the output of `SegformerForSemanticSegmentation` into semantic segmentation maps.
+     * @param {*} outputs Raw outputs of the model.
+     * @param {number[][]} [target_sizes=null] List of tuples corresponding to the requested final size
+     * (height, width) of each prediction. If unset, predictions will not be resized.
+     * @returns {{segmentation: Tensor; labels: number[]}[]} The semantic segmentation maps.
+     */
+    post_process_semantic_segmentation(outputs, target_sizes = null) {
+
+        const logits = outputs.logits;
+        const batch_size = logits.dims[0];
+
+        if (target_sizes !== null && target_sizes.length !== batch_size) {
+            throw Error("Make sure that you pass in as many target sizes as the batch dimension of the logits")
+        }
+
+        const toReturn = [];
+        for (let i = 0; i < batch_size; ++i) {
+            const target_size = target_sizes !== null ? target_sizes[i] : null;
+
+            let data = logits[i];
+
+            // 1. If target_size is not null, we need to resize the masks to the target size
+            if (target_size !== null) {
+                // resize the masks to the target size
+                data = interpolate(data, target_size, 'bilinear', false);
+            }
+            const [height, width] = target_size ?? data.dims.slice(-2);
+
+            const segmentation = new Tensor(
+                'int32',
+                new Int32Array(height * width),
+                [height, width]
+            );
+
+            // Buffer to store current largest value
+            const buffer = data[0].data;
+            for (let j = 1; j < data.dims[0]; ++j) {
+                const row = data[j].data;
+                for (let k = 0; k < row.length; ++k) {
+                    if (row[k] > buffer[k]) {
+                        buffer[k] = row[k];
+                        segmentation.data[k] = j;
+                    }
+                }
+            }
+
+            // Store which objects have labels
+            // This is much more efficient that creating a set of the final values
+            const hasLabel = new Array(data.dims[0]);
+            const out = segmentation.data;
+            for (let j = 0; j < out.length; ++j) {
+                const index = out[j];
+                hasLabel[index] = index;
+            }
+            /** @type {number[]} The unique list of labels that were detected */
+            const labels = hasLabel.filter(x => x !== undefined);
+
+            toReturn.push({ segmentation, labels });
+        }
+        return toReturn;
+    }
+}
 export class BitImageProcessor extends ImageFeatureExtractor { }
 export class DPTFeatureExtractor extends ImageFeatureExtractor { }
 export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
@@ -1710,6 +1775,7 @@ export class AutoProcessor {
         SiglipImageProcessor,
         ConvNextFeatureExtractor,
         ConvNextImageProcessor,
+        SegformerFeatureExtractor,
         BitImageProcessor,
         DPTFeatureExtractor,
         GLPNFeatureExtractor,
diff --git a/src/tokenizers.js b/src/tokenizers.js
index 8732241..94f5def 100644
--- a/src/tokenizers.js
+++ b/src/tokenizers.js
@@ -1182,17 +1182,61 @@ class BertNormalizer extends Normalizer {
         return text.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
     }
 
+
+    /**
+     * Checks whether `char` is a control character.
+     * @param {string} char The character to check.
+     * @returns {boolean} Whether `char` is a control character.
+     * @private
+     */
+    _is_control(char) {
+        switch (char) {
+            case '\t':
+            case '\n':
+            case '\r':
+                // These are technically control characters but we count them as whitespace characters.
+                return false;
+
+            default:
+                // Check if unicode category starts with C:
+                // Cc - Control
+                // Cf - Format
+                // Co - Private Use
+                // Cs - Surrogate
+                return /^\p{Cc}|\p{Cf}|\p{Co}|\p{Cs}$/u.test(char);
+        }
+    }
+
+    /**
+     * Performs invalid character removal and whitespace cleanup on text.
+     * @param {string} text The text to clean.
+     * @returns {string} The cleaned text.
+     * @private
+     */
+    _clean_text(text) {
+        const output = [];
+        for (const char of text) {
+            const cp = char.charCodeAt(0);
+            if (cp === 0 || cp === 0xFFFD || this._is_control(char)) {
+                continue;
+            }
+            if (/^\s$/.test(char)) { // is whitespace
+                output.push(" ");
+            } else {
+                output.push(char);
+            }
+        }
+        return output.join("");
+    }
     /**
      * Normalizes the given text based on the configuration.
      * @param {string} text The text to normalize.
      * @returns {string} The normalized text.
      */
     normalize(text) {
-        // TODO use rest of config
-        // config.clean_text,
-        // config.handle_chinese_chars,
-        // config.strip_accents,
-        // config.lowercase,
+        if (this.config.clean_text) {
+            text = this._clean_text(text);
+        }
 
         if (this.config.handle_chinese_chars) {
             text = this._tokenize_chinese_chars(text);
@@ -2036,6 +2080,18 @@ class BPEDecoder extends Decoder {
     }
 }
 
+// Custom decoder for VITS
+class VitsDecoder extends Decoder {
+    /** @type {Decoder['decode_chain']} */
+    decode_chain(tokens) {
+        let decoded = '';
+        for (let i = 1; i < tokens.length; i += 2) {
+            decoded += tokens[i];
+        }
+        return [decoded];
+    }
+}
+
 
 /**
  * This PreTokenizer replaces spaces with the given replacement character, adds a prefix space if requested,
@@ -2946,6 +3002,12 @@ export class ConvBertTokenizer extends PreTrainedTokenizer {
         return add_token_types(inputs);
     }
 }
+export class RoFormerTokenizer extends PreTrainedTokenizer {
+    /** @type {add_token_types} */
+    prepare_model_inputs(inputs) {
+        return add_token_types(inputs);
+    }
+}
 export class DistilBertTokenizer extends PreTrainedTokenizer { }
 export class CamembertTokenizer extends PreTrainedTokenizer { }
 export class XLMTokenizer extends PreTrainedTokenizer {
@@ -4121,6 +4183,15 @@ export class SpeechT5Tokenizer extends PreTrainedTokenizer { }
 
 export class NougatTokenizer extends PreTrainedTokenizer { }
 
+export class VitsTokenizer extends PreTrainedTokenizer {
+
+    constructor(tokenizerJSON, tokenizerConfig) {
+        super(tokenizerJSON, tokenizerConfig);
+
+        // Custom decoder function
+        this.decoder = new VitsDecoder({});
+    }
+}
 /**
  * Helper class which is used to instantiate pretrained tokenizers with the `from_pretrained` function.
  * The chosen tokenizer class is determined by the type specified in the tokenizer config.
@@ -4138,6 +4209,7 @@ export class AutoTokenizer {
         BertTokenizer,
         HerbertTokenizer,
         ConvBertTokenizer,
+        RoFormerTokenizer,
         XLMTokenizer,
         ElectraTokenizer,
         MobileBertTokenizer,
@@ -4168,6 +4240,7 @@ export class AutoTokenizer {
         BlenderbotSmallTokenizer,
         SpeechT5Tokenizer,
         NougatTokenizer,
+        VitsTokenizer,
 
         // Base case:
         PreTrainedTokenizer,
diff --git a/tests/generate_tests.py b/tests/generate_tests.py
index d4c4cc7..e047802 100644
--- a/tests/generate_tests.py
+++ b/tests/generate_tests.py
@@ -42,6 +42,14 @@ MODELS_TO_IGNORE = [
 
     # TODO: remove when https://github.com/huggingface/transformers/pull/26522 is merged
     'siglip',
+
+    # TODO: remove when https://github.com/huggingface/transformers/issues/28164 is fixed
+    'roformer',
+
+    # TODO: remove when https://github.com/huggingface/transformers/issues/28173 is fixed. Issues include:
+    # - decoding with `skip_special_tokens=True`.
+    # - interspersing the pad token is broken.
+    'vits',
 ]
 
 TOKENIZERS_TO_IGNORE = [
@@ -83,6 +91,9 @@ TOKENIZER_TEST_DATA = {
         "<s>\n",
         " </s> test </s> ",
         "</s>test</s>",
+
+        # Control characters
+        "1\u00002\uFFFD3",
     ],
     "custom_by_model_type": {
         "llama": [
@@ -115,7 +126,13 @@ TOKENIZER_TEST_DATA = {
             "The Heavenly Llama is said to drink water from the ocean and urinates as it rains.[6] According to " \
             "Aymara eschatology, llamas will return to the water springs and lagoons where they come from at the " \
             "end of time.[6]",
-        ]
+        ],
+
+        "vits": [
+            "abcdefghijklmnopqrstuvwxyz01234567890",
+            # Special treatment of characters in certain language
+            "ț ţ",
+        ],
     },
     "custom": {
         "facebook/blenderbot_small-90M": [
diff --git a/tests/init.js b/tests/init.js
index b01fe10..cda487e 100644
--- a/tests/init.js
+++ b/tests/init.js
@@ -26,10 +26,12 @@ export function init() {
         "Int8Array",
         "Int16Array",
         "Int32Array",
+        "BigInt64Array",
         "Uint8Array",
         "Uint8ClampedArray",
         "Uint16Array",
         "Uint32Array",
+        "BigUint64Array",
         "Float32Array",
         "Float64Array",
     ];
diff --git a/tests/pipelines.test.js b/tests/pipelines.test.js
index 5ccbb6e..9c9900c 100644
--- a/tests/pipelines.test.js
+++ b/tests/pipelines.test.js
@@ -909,6 +909,48 @@ describe('Pipelines', () => {
         }, MAX_TEST_EXECUTION_TIME);
     });
 
+    describe('Text-to-speech generation', () => {
+
+        // List all models which will be tested
+        const models = [
+            'microsoft/speecht5_tts',
+            'facebook/mms-tts-fra',
+        ];
+
+        it(models[0], async () => {
+            let synthesizer = await pipeline('text-to-speech', m(models[0]), {
+                // NOTE: Although the quantized version produces incoherent results,
+                // it it is okay to use for testing.
+                // quantized: false,
+            });
+
+            let speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';
+
+            { // Generate English speech
+                let output = await synthesizer('Hello, my dog is cute', { speaker_embeddings });
+                expect(output.audio.length).toBeGreaterThan(0);
+                expect(output.sampling_rate).toEqual(16000);
+            }
+
+            await synthesizer.dispose();
+
+        }, MAX_TEST_EXECUTION_TIME);
+
+        it(models[1], async () => {
+            let synthesizer = await pipeline('text-to-speech', m(models[1]));
+
+            { // Generate French speech
+                let output = await synthesizer('Bonjour');
+                expect(output.audio.length).toBeGreaterThan(0);
+                expect(output.sampling_rate).toEqual(16000);
+            }
+
+            await synthesizer.dispose();
+
+        }, MAX_TEST_EXECUTION_TIME);
+
+    });
+
     describe('Audio classification', () => {
 
         // List all models which will be tested
@@ -1122,6 +1164,7 @@ describe('Pipelines', () => {
         // List all models which will be tested
         const models = [
             'facebook/detr-resnet-50-panoptic',
+            'mattmdjaga/segformer_b2_clothes',
         ];
 
         it(models[0], async () => {
@@ -1153,6 +1196,47 @@ describe('Pipelines', () => {
             await segmenter.dispose();
 
         }, MAX_TEST_EXECUTION_TIME);
+
+        it(models[1], async () => {
+            let segmenter = await pipeline('image-segmentation', m(models[1]));
+            let img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
+
+            // single
+            {
+                let outputs = await segmenter(img);
+
+                let expected = [
+                    { label: 'Background' },
+                    { label: 'Hair' },
+                    { label: 'Upper-clothes' },
+                    { label: 'Pants' },
+                    { label: 'Left-shoe' },
+                    { label: 'Right-shoe' },
+                    { label: 'Face' },
+                    { label: 'Left-leg' },
+                    { label: 'Right-leg' },
+                    { label: 'Left-arm' },
+                    { label: 'Right-arm' },
+                ];
+
+                let outputLabels = outputs.map(x => x.label);
+                let expectedLabels = expected.map(x => x.label);
+
+                expect(outputLabels).toHaveLength(expectedLabels.length);
+                expect(outputLabels.sort()).toEqual(expectedLabels.sort())
+
+                // check that all scores are null, and masks have correct dimensions
+                for (let output of outputs) {
+                    expect(output.score).toBeNull();
+                    expect(output.mask.width).toEqual(970);
+                    expect(output.mask.height).toEqual(1455);
+                    expect(output.mask.channels).toEqual(1);
+                }
+            }
+
+            await segmenter.dispose();
+
+        }, MAX_TEST_EXECUTION_TIME);
     });
 
     describe('Zero-shot image classification', () => {
diff --git a/tests/requirements.txt b/tests/requirements.txt
index df750ed..5fdb282 100644
--- a/tests/requirements.txt
+++ b/tests/requirements.txt
@@ -2,3 +2,4 @@ transformers[torch]@git+https://github.com/huggingface/transformers
 sacremoses==0.0.53
 sentencepiece==0.1.99
 protobuf==4.24.3
+rjieba==0.1.11
diff --git a/tests/tokenizers.test.js b/tests/tokenizers.test.js
index bc2bf12..6bafa23 100644
--- a/tests/tokenizers.test.js
+++ b/tests/tokenizers.test.js
@@ -29,6 +29,9 @@ describe('Tokenizers (dynamic)', () => {
 
                 expect(encoded).toEqual(test.encoded);
 
+                // Skip decoding tests if encoding produces zero tokens
+                if (test.encoded.input_ids.length === 0) continue;
+
                 // Test decoding
                 let decoded_with_special = tokenizer.decode(encoded.input_ids, { skip_special_tokens: false });
                 expect(decoded_with_special).toEqual(test.decoded_with_special);