LocalAI - Models

arcee-ai_afm-4.5b

AFM-4.5B is a 4.5 billion parameter instruction-tuned model developed by Arcee.ai, designed for enterprise-grade performance across diverse deployment environments from cloud to edge. The base model was trained on a dataset of 8 trillion tokens, comprising 6.5 trillion tokens of general pretraining data followed by 1.5 trillion tokens of midtraining data with enhanced focus on mathematical reasoning and code generation. Following pretraining, the model underwent supervised fine-tuning on high-quality instruction datasets. The instruction-tuned model was further refined through reinforcement learning on verifiable rewards as well as for human preference. We use a modified version of TorchTitan for pretraining, Axolotl for supervised fine-tuning, and a modified version of Verifiers for reinforcement learning. The development of AFM-4.5B prioritized data quality as a fundamental requirement for achieving robust model performance. We collaborated with DatologyAI, a company specializing in large-scale data curation. DatologyAI's curation pipeline integrates a suite of proprietary algorithms—model-based quality filtering, embedding-based curation, target distribution-matching, source mixing, and synthetic data. Their expertise enabled the creation of a curated dataset tailored to support strong real-world performance. The model architecture follows a standard transformer decoder-only design based on Vaswani et al., incorporating several key modifications for enhanced performance and efficiency. Notable architectural features include grouped query attention for improved inference efficiency and ReLU^2 activation functions instead of SwiGLU to enable sparsification while maintaining or exceeding performance benchmarks. The model available in this repo is the instruct model following supervised fine-tuning and reinforcement learning.

Links

Tags

qwen3-embedding-4b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-4B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 4B - Context Length: 32k - Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-4B-GGUF

Tags

qwen3-embedding-8b

The Qwen3 Embedding series model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-8B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 8B - Context Length: 32k - Embedding Dimension: Up to 4096, supports user-defined output dimensions ranging from 32 to 4096 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-8B-GGUF

Tags

qwen3-embedding-0.6b

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of June 5, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios. **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. **Multilingual Capability**: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. **Qwen3-Embedding-0.6B-GGUF** has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 - Quantization: q8_0, f16

Links

https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF

Tags

google-gemma-3-27b-it-qat-q4_0-small

This is a requantized version of https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Requantizing with llama.cpp achieves a very similar result. Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. The perplexity score for this one is even lower with this model compared to the original model by Google, but the results are within margin of error, so it's probably just luck. I also fixed the control token metadata, which was slightly degrading the performance of the model in instruct mode.

Links

Tags

rwkv-6-world-7b

RWKV (pronounced RwaKuv) is an RNN with GPT-level LLM performance, and can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7. So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free text embedding. Moreover it's 100% attention-free, and a Linux Foundation AI project.

Links

Tags

granite-embedding-107m-multilingual

Granite-Embedding-107M-Multilingual is a 107M parameter dense biencoder embedding model from the Granite Embeddings suite that can be used to generate high quality text embeddings. This model produces embedding vectors of size 384 and is trained using a combination of open source relevance-pair datasets with permissive, enterprise-friendly license, and IBM collected and generated datasets. This model is developed using contrastive finetuning, knowledge distillation and model merging for improved performance.

Links

Tags

granite-embedding-125m-english

Granite-Embedding-125m-English is a 125M parameter dense biencoder embedding model from the Granite Embeddings suite that can be used to generate high quality text embeddings. This model produces embedding vectors of size 768. Compared to most other open-source models, this model was only trained using open-source relevance-pair datasets with permissive, enterprise-friendly license, plus IBM collected and generated datasets. While maintaining competitive scores on academic benchmarks such as BEIR, this model also performs well on many enterprise use cases. This model is developed using retrieval oriented pretraining, contrastive finetuning and knowledge distillation.

Links

Tags

ultravox-v0_5-llama-3_1-8b

Ultravox is a multimodal Speech LLM built around a pretrained Llama3.1-8B-Instruct and whisper-large-v3-turbo backbone. See https://ultravox.ai for the GitHub repo and more information. Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message). The input to the model is given as a text prompt with a special <|audio|> pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio. Using the merged embeddings as input, the model will then generate output text as usual. In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output. No preference tuning has been applied to this revision of the model.

Links

Tags

gemma-2-9b-it-abliterated

Abliterated version of google/gemma-2-9b-it. The abliteration script (link) is based on code from the blog post and heavily uses TransformerLens. The only major difference from the code used for Llama is scaling the embedding layer back. Orthogonalization did not produce the same results as regular interventions since there are RMSNorm layers before merging activations into the residual stream. However, the final model still seems to be uncensored.

Links

Tags

openvino-multilingual-e5-base

Links

https://huggingface.co/intfloat/multilingual-e5-base

Tags

openvino-all-MiniLM-L6-v2

Links

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Tags

all-MiniLM-L6-v2

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

Links

https://github.com/UKPLab/sentence-transformers

Tags

bert-embeddings

llama3.2 embeddings model. Using as drop-in replacement for bert-embeddings

Links

https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF

Tags

nomic-embed-text-v1.5

Resizable Production Embeddings with Matryoshka Representation Learning

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

arcee-ai_afm-4.5b

qwen3-embedding-4b

qwen3-embedding-8b

qwen3-embedding-0.6b

google-gemma-3-27b-it-qat-q4_0-small

rwkv-6-world-7b

granite-embedding-107m-multilingual

granite-embedding-125m-english

ultravox-v0_5-llama-3_1-8b

gemma-2-9b-it-abliterated

openvino-multilingual-e5-base

openvino-all-MiniLM-L6-v2

all-MiniLM-L6-v2

bert-embeddings

nomic-embed-text-v1.5