RAG for knime2py — Design & Usage
This document describes the Retrieval-Augmented Generation (RAG) layer implemented for the project. It covers the architecture, utilities, scripts, configuration, prompt/retrieval strategy, token budgeting, and concrete usage examples.
1) High-level Overview
- Goal. Answer repository questions and safely rewrite single files using an LLM, grounding the LLM on the codebase via a local ChromaDB index.
- Backends.
- Embeddings:
openai(e.g.,text-embedding-3-large) or local SBERT (e.g.,sentence-transformers/all-MiniLM-L6-v2). - Generation: OpenAI chat models (
gpt-4o*, etc.) or Ollama (local models likellama3).
- Embeddings:
- Index. A persistent Chroma index under
./.rag_index/with:- a chunk collection for content passages, and
- a manifest collection mapping basenames (e.g.,
registry.py) to full repo paths, used for filename-hint retrieval.
- Repository structure file. A small
rag/.generated/STRUCTURE.mdis assumed to exist and is used to give the LLM top-level orientation; a small number of its chunks are always injected (reserved slots).
2) Implemented Components
2.1 rag/rag_utils.py (shared utilities)
- Config & Naming
RAGConfig,load_config_from_env(...)- Collection naming:
current_collection_name(...),manifest_collection_name(...)
- Chroma Access
get_client(...),get_collection(...),get_manifest(...)
- Embeddings
encode_query(...)(OpenAI or SBERT)
- Retrieval Primitives
retrieve_raw(...),fetch_file_chunks_by_path(...)- Filename hints:
extract_file_hints(...),find_paths_by_basenames(...) - Structure slice:
structure_chunks(...)
- Composite Retrieval
retrieve_with_structure_and_hints(...)— order without rerank:- filename-hinted chunks (per-file cap),
STRUCTURE.mdchunks (reserved small slice),- vector search fill;
- optional rerank via cross-encoder (
RAG_RERANK=1).
- Prompt Utilities
format_chunks(...)(for QA),build_qa_prompt(system_prompt, question, passages, extra_instructions=""),QA_SYSTEM_PROMPTandsystem_prompt(...)(overrides viaRAG_SYS_PROMPTorRAG_SYS_PROMPT_FILE)
- Token Budgeting
count_tokens(...),ensure_prompt_fits(...),resolve_context_window(...)
- Edit-mode Helpers
extract_between_markers(...)for strict<<BEGIN_FILE>> … <<END_FILE>>,lang_for(...)(for informative context blocks)
- Banner
print_mode_banner(...)unified banner for OpenAI and Ollama frontends
Why it matters: this consolidation removes duplication and keeps the front-end scripts thin.
2.2 rag/query_openai.py (Q&A with OpenAI)
- Uses shared utilities for config, retrieval, prompt formatting, token guard, and banner.
- Prompt:
QA_SYSTEM_PROMPTwith instructions to cite chunks and stick to repository context. - CLI:
--top-kto control context breadth,--show-sourcesto print retrieved chunk references,--modelto switch models at runtime.
- Rerank: optional (
RAG_RERANK=1) via cross-encoder (RAG_RERANK_MODEL), withRERANK_Kas the over-fetch size.
2.3 rag/query_ollama.py (Q&A with Ollama)
- Parity with
query_openai.py, but generation is sent to Ollama. - Context window and max output use Ollama-family heuristics.
- Same retrieval strategy (structure slice + hints + vector fill).
- Uses the shared banner, prompt formatting, and token guard.
2.4 rag/query_openai_file.py (single-file editor with OpenAI)
- Purpose: Rewrite a single file and emit only the fully updated file.
- Strict I/O contract.
- The model must return the complete file wrapped between markers:
<<BEGIN_FILE>> # ...entire updated file... <<END_FILE>> - The script extracts only the payload between the markers.
- The model must return the complete file wrapped between markers:
- Context construction:
- Small
STRUCTURE.mdslice (reserved slots), - Filename-hinted + vector-retrieved chunks excluding the target file (prevents parroting),
- The entire current file is injected as “source of truth.”
- Small
- Dynamic token budgeting:
- Computes
computed_max_output = min(requested_max_output, ctx_window - input_tokens - safety), - Prints the computed max tokens before sending the request,
- Applies
ensure_prompt_fits(...)with the computed value.
- Computes
- Safety & determinism:
temperature=0.0, strict markers, and clear constraints.
--rewrite:- If supplied, writes the updated content back to the original file path (assumes version control is guarding against irrevocable loss).
3) Retrieval Strategy
- Filename hints (priority).
If the prompt mentions files likeregistry.py, those basenames are resolved through the manifest collection into full paths. A small capped number of chunks per hinted file are injected first. - Structure slice (reserved).
A small number of chunks fromrag/.generated/STRUCTURE.mdare injected to give the model a global layout view. This prevents the editor/QA from hallucinating directories or missing module counts. - Vector search fill.
The remainder of the context is filled with standard vector search results from the main code chunk collection. Optional re-ranking can be applied across the union (hints + structure + raw). - Exclusions (edit mode).
When editing, the target file’s chunks are excluded from retrieval so the model cannot simply echo the original content. The canonical source is the explicit “Target file (current contents)” section.
4) Prompting
- Q&A prompts use
QA_SYSTEM_PROMPT(from utils) with appended instructions to:- cite chunks by index and path,
- give minimal, correct code when needed,
- rely on
STRUCTURE.mdwhen present, - admit “don’t know” if the answer isn’t in context.
- Edit prompts use a strict edit system prompt (in the file editor script) requiring the full file between markers and forbidding commentary, headers, or code fences.
- Overrides: You can override the base QA system prompt without editing code:
RAG_SYS_PROMPT="...your text...", orRAG_SYS_PROMPT_FILE=/abs/path/to/prompt.txt.
5) Token Budgeting
resolve_context_window(model, map, default)picks a conservative context limit for each model family.count_tokens(...)usestiktokenif available (fallback heuristic otherwise).ensure_prompt_fits(...)enforcesinput_tokens + max_output + safety <= context_window.- The file editor computes dynamic
max_outputper request and prints the final value.
6) Configuration
Create a .env or set env vars in your shell. Sensible code-first defaults are used when possible.
6.1 Core Paths
| Variable | Default | Purpose |
|---|---|---|
RAG_REPO_ROOT |
project root (parent of rag/) |
Repository base for relative paths |
RAG_INDEX_DIR |
.rag_index/ under RAG_REPO_ROOT |
Chroma persistent store |
RAG_COLLECTION |
code_chunks |
Base name for collections |
RAG_STRUCTURE_PATH |
rag/.generated/STRUCTURE.md |
Structure file used as small context slice |
6.2 Embeddings & Retrieval
| Variable | Default | Notes |
|---|---|---|
RAG_EMBED_BACKEND |
openai (Q&A OpenAI) / sbert (Ollama) |
openai or sbert |
RAG_EMBED_MODEL |
text-embedding-3-large or sentence-transformers/all-MiniLM-L6-v2 |
Auto-selected by backend if not set |
RAG_TOP_K |
6 |
Total retrieved chunks target |
RAG_RERANK |
0 |
Set 1 to enable cross-encoder re-rank |
RAG_RERANK_K |
max(TOP_K, 20) |
Over-fetch size for re-rank |
RAG_STRUCTURE_MAX_CHUNKS |
1 |
Reserved STRUCTURE.md slices |
RAG_FILE_HINT_MAX_CHUNKS |
8 |
Per-file cap for hinted files |
6.3 Generation
| Variable | Default | Notes |
|---|---|---|
RAG_OPENAI_MODEL |
gpt-4o-mini |
OpenAI model for Q&A and editing |
OPENAI_API_KEY |
(required) | Needed for OpenAI embeddings and/or gen |
RAG_OLLAMA_MODEL |
llama3 |
Ollama model name |
OLLAMA_URL |
http://localhost:11434/api/generate |
Ollama endpoint |
RAG_SAFETY_MARGIN_TOKENS |
1024 (OpenAI) / 512 (Ollama) |
Safety headroom |
OPENAI_MAX_OUTPUT |
4096 (Q&A default) |
File editor computes final value dynamically |
7) Usage
7.1 Q&A (OpenAI)
python -m rag.query_openai "How is the registry initialized?" --show-sources
# Optional model override:
python -m rag.query_openai "What writes the DOT graph?" --model gpt-4o
# Adjust retrieval breadth:
python -m rag.query_openai "Where is STRUCTURE.md produced?" --top-k 10
````
### 7.2 Q&A (Ollama)
```bash
python -m rag.query_ollama "Explain emitters pipeline."
# With more context and sources:
python -m rag.query_ollama "How are chunks stored?" --top-k 10 --show-sources
7.3 Edit a Single File (OpenAI)
python -m rag.query_openai_file "src/knime2py/implemented_cli.py" \
--edit "Add meaningful docstrings to all public functions based on the code." \
--rewrite
- The script prints a banner.
- It logs the computed
max_outputbefore requesting completion. - It prints only the updated file (and rewrites in place if
--rewriteis present).
7.4 Bulk docstring insertion (example)
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="${1:-src/}"
find "$TARGET_DIR" -type f -name '*.py' ! -name '__*.py' -print0 \
| while IFS= read -r -d '' PYFILE; do
echo "[RAG] Editing: $PYFILE"
python -m rag.query_openai_file \
"$PYFILE" \
--edit "Add meaningful docstrings to all public functions based on the code." \
--rewrite
done
8) Operational Notes
- STRUCTURE.md must exist and be chunked into the index; otherwise the scripts still work but lose the layout hints. Keep the reserved slice small (e.g., 1 chunk).
-
Index lifecycle. If retrieval returns nothing, you likely have:
-
a missing or stale
./.rag_index/, - embed backend/model mismatch (collection name differs),
- or wrong repo root. Rebuild/reindex and ensure envs match.
- Edit mode exclusions. The editor deliberately excludes the target file from retrieval to avoid parroting. The source of truth for the target file is the injected “Target file (current contents)” block.
-
Token budgeting. If you hit context limits:
-
lower
--top-k,RAG_STRUCTURE_MAX_CHUNKS, orRAG_FILE_HINT_MAX_CHUNKS, - reduce the size of your request text,
- or switch to a larger-context model.
9) Troubleshooting
- “RAG index directory not found … Build the index first.”
The Chroma DB at
./.rag_index/is missing. Rebuild the index with your indexing script (not included here). - “No context retrieved …”
Index is empty or collections don’t match your current
RAG_EMBED_BACKEND/RAG_EMBED_MODEL. Align envs and reindex. - Marker errors in edit mode.
The model must return the file only between
<<BEGIN_FILE>>and<<END_FILE>>. If violated, the script fails fast. Re-run or tighten the request. - OpenAI key missing.
Set
OPENAI_API_KEYin.envor shell for OpenAI embeddings/generation. For Ollama + SBERT you can avoid OpenAI entirely.
10) Security & Safety
- Secrets are not embedded in code. Only
OPENAI_API_KEYis required for OpenAI paths. - The
--rewriteflag writes in place. Use it on files under version control only. Review diffs.
11) Known Limitations / Next Steps
- Indexing pipeline is assumed but not documented here (chunk size/overlap, filters). Add a reproducible
rag/index_repo.pywith deterministic chunking and a file manifest writer. - Deduping & diversity. Current retrieval favors hinted files and structure. Consider MMR or diversity-aware selection when
TOP_Kgrows. - Reranking latency. Cross-encoder reranking adds latency; enable only when needed (
RAG_RERANK=1). - Non-Python hints. Filename hints target
*.pytoday. Extend the hint regex if you want JS/TS/MD, etc. - Streaming. Current calls are non-streaming for simplicity.
12) Minimal .env Example
# Index + collection
RAG_REPO_ROOT=.
RAG_INDEX_DIR=.rag_index
RAG_COLLECTION=code_chunks
RAG_STRUCTURE_PATH=rag/.generated/STRUCTURE.md
# Retrieval
RAG_EMBED_BACKEND=openai # or sbert
RAG_EMBED_MODEL=text-embedding-3-large
# Generation
RAG_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
# Optional
RAG_TOP_K=6
RAG_STRUCTURE_MAX_CHUNKS=1
RAG_FILE_HINT_MAX_CHUNKS=8
RAG_RERANK=0
RAG_RERANK_K=20
RAG_SAFETY_MARGIN_TOKENS=1024