Build Embeddings and Semantic Search with Sentence Transformers
Keyword search matches letters; semantic search matches meaning. The trick is to turn each piece of text into a vector (embedding) so that similar meanings land close together — then a query finds its nearest neighbours. This tutorial builds a working semantic search engine with Sentence Transformers and FAISS in a few lines of Python.
How it works
Prerequisites
- Python 3.10+ (no GPU needed for this example)
1. Install
python -m pip install -U sentence-transformers faiss-cpu
2. Embed a corpus
all-MiniLM-L6-v2 is small, fast, and produces 384-dimensional vectors — perfect for getting started.
from sentence_transformers import SentenceTransformer
docs = [
"How do I reset my password?",
"The invoice total looks incorrect.",
"Where can I download my receipt?",
"My account is locked after too many attempts.",
"How do I change my billing address?",
]
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(docs, normalize_embeddings=True) # shape (5, 384)
print(embeddings.shape) # (5, 384)
Normalizing the vectors means a plain dot-product equals cosine similarity — convenient for search.
3. Build a FAISS index
import faiss, numpy as np
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim) # inner product == cosine on normalized vectors
index.add(np.asarray(embeddings, dtype="float32"))
print(index.ntotal) # 5
4. Search
def search(query, k=3):
q = model.encode([query], normalize_embeddings=True).astype("float32")
scores, ids = index.search(q, k)
return [(docs[i], round(float(s), 3)) for s, i in zip(scores[0], ids[0])]
for hit in search("I forgot my login credentials"):
print(hit)
Expected output:
('My account is locked after too many attempts.', 0.62)
('How do I reset my password?', 0.61)
('Where can I download my receipt?', 0.27)
Notice the top hits never share the words “forgot” or “login” — they match on meaning. That’s the whole point of embeddings, and it’s exactly what powers retrieval-augmented generation (RAG).
5. Scale up
- Swap
IndexFlatIPforIndexIVFFlatorIndexHNSWFlatwhen you have millions of vectors. - Persist with
faiss.write_index(index, "corpus.faiss")and reload withfaiss.read_index(...). - For a managed vector database, the same vectors drop straight into Milvus, Pinecone, or pgvector.
Common errors
faissimport fails — installfaiss-cpu(orfaiss-gpuon CUDA machines).- Poor results — make sure you encode the query with the same model, and normalize both sides.
- Slow first run — the model downloads once, then is cached locally.
FAQ
Which embedding model should I use?
all-MiniLM-L6-v2 for speed; all-mpnet-base-v2 for higher quality; multilingual-e5-base for many languages.
Is this the same as RAG? Retrieval is the first half of RAG — you retrieve relevant chunks by embedding similarity, then feed them to an LLM to answer.
Leave a comment