RAG systems use retrieved documents for context but don't tell how each retrieved document chunk affects their responses, so users can't verify faithfulness. Using a RAG system with sources we control, this project asks the Generation LLM to use only the retrieved context for its response, allowing us to evaluate how the LLMs use the context they get to provide a response.
all-MiniLM-L6-v2 from sentence-transformersgpt-4.1-miniall-MiniLM-L6-v2) to reveal reliance on the sourcegpt-4o-mini as the LLM judgeThe system was tested using four uploaded documents. If a question falls outside their scope, the model is designed to admit it cannot answer, this is part of the controlled experiment using a restricted prompting template.
The current documents include:
The system can ingest any .txt or .md file placed in the documents folder.
The full implementation is available in my GitHub repository: siliconshells.
This section shows which stored text chunks were retrieved as context for the LLM’s answer generation. The system retrieves the top three chunks and displays the file each chunk originated from.
Token-Level Saliency measures how much each individual token (word) contributes to the answer's meaning. It uses a leave-one-out approach: each token is removed from the answer one at a time, and the shift in the sentence embedding measures how much that token mattered.
The first four tabs explain how the answer was constructed. This tab grades how good the answer is, using Ragas's reference-free LLM-judged metrics. After the answer renders, the page asynchronously requests scores from the server (so the answer never waits on them).
Three scores are computed by an LLM judge (gpt-4o-mini):
Scores are cached in Redis for 24 hours per question, so repeating a question costs zero API calls.
Reference-based metrics like context_recall are excluded from the live tab because they
require curated ground-truth answers.
Lists sentences with very low attribution scores—likely not sourced from retrieved documents and possibly hallucinated by the model.