RAG for `knime2py` — Design & Usage

This document describes the Retrieval-Augmented Generation (RAG) layer implemented for the project. It covers the architecture, utilities, scripts, configuration, prompt/retrieval strategy, token budgeting, and concrete usage examples.

1) High-level Overview

Goal. Answer repository questions and safely rewrite single files using an LLM, grounding the LLM on the codebase via a local ChromaDB index.
Backends.
1. Embeddings: openai (e.g., text-embedding-3-large) or local SBERT (e.g., sentence-transformers/all-MiniLM-L6-v2).
2. Generation: OpenAI chat models (gpt-4o*, etc.) or Ollama (local models like llama3).
Index. A persistent Chroma index under ./.rag_index/ with:
1. a chunk collection for content passages, and
2. a manifest collection mapping basenames (e.g., registry.py) to full repo paths, used for filename-hint retrieval.
Repository structure file. A small rag/.generated/STRUCTURE.md is assumed to exist and is used to give the LLM top-level orientation; a small number of its chunks are always injected (reserved slots).

2) Implemented Components

2.1 `rag/rag_utils.py` (shared utilities)

Config & Naming
1. RAGConfig, load_config_from_env(...)
2. Collection naming: current_collection_name(...), manifest_collection_name(...)
Chroma Access
1. get_client(...), get_collection(...), get_manifest(...)
Embeddings
1. encode_query(...) (OpenAI or SBERT)
Retrieval Primitives
1. retrieve_raw(...), fetch_file_chunks_by_path(...)
2. Filename hints: extract_file_hints(...), find_paths_by_basenames(...)
3. Structure slice: structure_chunks(...)
Composite Retrieval
1. retrieve_with_structure_and_hints(...) — order without rerank:
  1. filename-hinted chunks (per-file cap),
  2. STRUCTURE.md chunks (reserved small slice),
  3. vector search fill;
  4. optional rerank via cross-encoder (RAG_RERANK=1).
Prompt Utilities
1. format_chunks(...) (for QA),
2. build_qa_prompt(system_prompt, question, passages, extra_instructions=""),
3. QA_SYSTEM_PROMPT and system_prompt(...) (overrides via RAG_SYS_PROMPT or RAG_SYS_PROMPT_FILE)
Token Budgeting
1. count_tokens(...), ensure_prompt_fits(...), resolve_context_window(...)
Edit-mode Helpers
1. extract_between_markers(...) for strict <<BEGIN_FILE>> … <<END_FILE>>,
2. lang_for(...) (for informative context blocks)
Banner
1. print_mode_banner(...) unified banner for OpenAI and Ollama frontends

Why it matters: this consolidation removes duplication and keeps the front-end scripts thin.

2.2 `rag/query_openai.py` (Q&A with OpenAI)

Uses shared utilities for config, retrieval, prompt formatting, token guard, and banner.
Prompt: QA_SYSTEM_PROMPT with instructions to cite chunks and stick to repository context.
CLI:
1. --top-k to control context breadth,
2. --show-sources to print retrieved chunk references,
3. --model to switch models at runtime.
Rerank: optional (RAG_RERANK=1) via cross-encoder (RAG_RERANK_MODEL), with RERANK_K as the over-fetch size.

2.3 `rag/query_ollama.py` (Q&A with Ollama)

Parity with query_openai.py, but generation is sent to Ollama.
Context window and max output use Ollama-family heuristics.
Same retrieval strategy (structure slice + hints + vector fill).
Uses the shared banner, prompt formatting, and token guard.

2.4 `rag/query_openai_file.py` (single-file editor with OpenAI)

Purpose: Rewrite a single file and emit only the fully updated file.
Strict I/O contract.
1. The model must return the complete file wrapped between markers: <<BEGIN_FILE>> # ...entire updated file... <<END_FILE>>
2. The script extracts only the payload between the markers.
Context construction:
1. Small STRUCTURE.md slice (reserved slots),
2. Filename-hinted + vector-retrieved chunks excluding the target file (prevents parroting),
3. The entire current file is injected as “source of truth.”
Dynamic token budgeting:
1. Computes computed_max_output = min(requested_max_output, ctx_window - input_tokens - safety),
2. Prints the computed max tokens before sending the request,
3. Applies ensure_prompt_fits(...) with the computed value.
Safety & determinism:
1. temperature=0.0, strict markers, and clear constraints.
--rewrite:
1. If supplied, writes the updated content back to the original file path (assumes version control is guarding against irrevocable loss).

3) Retrieval Strategy

Filename hints (priority).
If the prompt mentions files like registry.py, those basenames are resolved through the manifest collection into full paths. A small capped number of chunks per hinted file are injected first.
Structure slice (reserved).
A small number of chunks from rag/.generated/STRUCTURE.md are injected to give the model a global layout view. This prevents the editor/QA from hallucinating directories or missing module counts.
Vector search fill.
The remainder of the context is filled with standard vector search results from the main code chunk collection. Optional re-ranking can be applied across the union (hints + structure + raw).
Exclusions (edit mode).
When editing, the target file’s chunks are excluded from retrieval so the model cannot simply echo the original content. The canonical source is the explicit “Target file (current contents)” section.

4) Prompting

Q&A prompts use QA_SYSTEM_PROMPT (from utils) with appended instructions to:
1. cite chunks by index and path,
2. give minimal, correct code when needed,
3. rely on STRUCTURE.md when present,
4. admit “don’t know” if the answer isn’t in context.
Edit prompts use a strict edit system prompt (in the file editor script) requiring the full file between markers and forbidding commentary, headers, or code fences.
Overrides: You can override the base QA system prompt without editing code:
1. RAG_SYS_PROMPT="...your text...", or
2. RAG_SYS_PROMPT_FILE=/abs/path/to/prompt.txt.

5) Token Budgeting

resolve_context_window(model, map, default) picks a conservative context limit for each model family.
count_tokens(...) uses tiktoken if available (fallback heuristic otherwise).
ensure_prompt_fits(...) enforces input_tokens + max_output + safety <= context_window.
The file editor computes dynamic max_output per request and prints the final value.

6) Configuration

Create a .env or set env vars in your shell. Sensible code-first defaults are used when possible.

6.1 Core Paths

Variable	Default	Purpose
`RAG_REPO_ROOT`	project root (parent of `rag/`)	Repository base for relative paths
`RAG_INDEX_DIR`	`.rag_index/` under `RAG_REPO_ROOT`	Chroma persistent store
`RAG_COLLECTION`	`code_chunks`	Base name for collections
`RAG_STRUCTURE_PATH`	`rag/.generated/STRUCTURE.md`	Structure file used as small context slice

6.2 Embeddings & Retrieval

Variable	Default	Notes
`RAG_EMBED_BACKEND`	`openai` (Q&A OpenAI) / `sbert` (Ollama)	`openai` or `sbert`
`RAG_EMBED_MODEL`	`text-embedding-3-large` or `sentence-transformers/all-MiniLM-L6-v2`	Auto-selected by backend if not set
`RAG_TOP_K`	`6`	Total retrieved chunks target
`RAG_RERANK`	`0`	Set `1` to enable cross-encoder re-rank
`RAG_RERANK_K`	`max(TOP_K, 20)`	Over-fetch size for re-rank
`RAG_STRUCTURE_MAX_CHUNKS`	`1`	Reserved STRUCTURE.md slices
`RAG_FILE_HINT_MAX_CHUNKS`	`8`	Per-file cap for hinted files

6.3 Generation

Variable	Default	Notes
`RAG_OPENAI_MODEL`	`gpt-4o-mini`	OpenAI model for Q&A and editing
`OPENAI_API_KEY`	(required)	Needed for OpenAI embeddings and/or gen
`RAG_OLLAMA_MODEL`	`llama3`	Ollama model name
`OLLAMA_URL`	`http://localhost:11434/api/generate`	Ollama endpoint
`RAG_SAFETY_MARGIN_TOKENS`	`1024` (OpenAI) / `512` (Ollama)	Safety headroom
`OPENAI_MAX_OUTPUT`	`4096` (Q&A default)	File editor computes final value dynamically

7) Usage

7.1 Q&A (OpenAI)

python -m rag.query_openai "How is the registry initialized?" --show-sources
# Optional model override:
python -m rag.query_openai "What writes the DOT graph?" --model gpt-4o
# Adjust retrieval breadth:
python -m rag.query_openai "Where is STRUCTURE.md produced?" --top-k 10
````

### 7.2 Q&A (Ollama)

```bash
python -m rag.query_ollama "Explain emitters pipeline."
# With more context and sources:
python -m rag.query_ollama "How are chunks stored?" --top-k 10 --show-sources

7.3 Edit a Single File (OpenAI)

python -m rag.query_openai_file "src/knime2py/implemented_cli.py" \
  --edit "Add meaningful docstrings to all public functions based on the code." \
  --rewrite

The script prints a banner.
It logs the computed max_output before requesting completion.
It prints only the updated file (and rewrites in place if --rewrite is present).

7.4 Bulk docstring insertion (example)

#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="${1:-src/}"

find "$TARGET_DIR" -type f -name '*.py' ! -name '__*.py' -print0 \
| while IFS= read -r -d '' PYFILE; do
  echo "[RAG] Editing: $PYFILE"
  python -m rag.query_openai_file \
    "$PYFILE" \
    --edit "Add meaningful docstrings to all public functions based on the code." \
    --rewrite
done

8) Operational Notes

STRUCTURE.md must exist and be chunked into the index; otherwise the scripts still work but lose the layout hints. Keep the reserved slice small (e.g., 1 chunk).
Index lifecycle. If retrieval returns nothing, you likely have:
a missing or stale ./.rag_index/,
embed backend/model mismatch (collection name differs),
or wrong repo root. Rebuild/reindex and ensure envs match.
Edit mode exclusions. The editor deliberately excludes the target file from retrieval to avoid parroting. The source of truth for the target file is the injected “Target file (current contents)” block.
Token budgeting. If you hit context limits:
lower --top-k, RAG_STRUCTURE_MAX_CHUNKS, or RAG_FILE_HINT_MAX_CHUNKS,
reduce the size of your request text,
or switch to a larger-context model.

9) Troubleshooting

“RAG index directory not found … Build the index first.” The Chroma DB at ./.rag_index/ is missing. Rebuild the index with your indexing script (not included here).
“No context retrieved …” Index is empty or collections don’t match your current RAG_EMBED_BACKEND/RAG_EMBED_MODEL. Align envs and reindex.
Marker errors in edit mode. The model must return the file only between <<BEGIN_FILE>> and <<END_FILE>>. If violated, the script fails fast. Re-run or tighten the request.
OpenAI key missing. Set OPENAI_API_KEY in .env or shell for OpenAI embeddings/generation. For Ollama + SBERT you can avoid OpenAI entirely.

10) Security & Safety

Secrets are not embedded in code. Only OPENAI_API_KEY is required for OpenAI paths.
The --rewrite flag writes in place. Use it on files under version control only. Review diffs.

11) Known Limitations / Next Steps

Indexing pipeline is assumed but not documented here (chunk size/overlap, filters). Add a reproducible rag/index_repo.py with deterministic chunking and a file manifest writer.
Deduping & diversity. Current retrieval favors hinted files and structure. Consider MMR or diversity-aware selection when TOP_K grows.
Reranking latency. Cross-encoder reranking adds latency; enable only when needed (RAG_RERANK=1).
Non-Python hints. Filename hints target *.py today. Extend the hint regex if you want JS/TS/MD, etc.
Streaming. Current calls are non-streaming for simplicity.

12) Minimal `.env` Example

# Index + collection
RAG_REPO_ROOT=.
RAG_INDEX_DIR=.rag_index
RAG_COLLECTION=code_chunks
RAG_STRUCTURE_PATH=rag/.generated/STRUCTURE.md

# Retrieval
RAG_EMBED_BACKEND=openai        # or sbert
RAG_EMBED_MODEL=text-embedding-3-large

# Generation
RAG_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...

# Optional
RAG_TOP_K=6
RAG_STRUCTURE_MAX_CHUNKS=1
RAG_FILE_HINT_MAX_CHUNKS=8
RAG_RERANK=0
RAG_RERANK_K=20
RAG_SAFETY_MARGIN_TOKENS=1024

RAG for knime2py — Design & Usage