This tool utilizes a novel method for evaluating an LLM's writing ability.
Note: this is a work in progress!
A large creative text is used to fill the context window and the model is instructed to continue the text.
A set of standard metrics for evaluating writing is produced from the generated output along with the original text at the point of divergence for the same tokens as the model generated.
A visualization tool is provided which allows you to compare across generations, models, context windows, test parameters, etc.
For every set of tests the tool will always give the model the same continuation point in the text. Large windows are filled with tokens starting backwards from the continuation points.
Example:
For a set of tests composed of 4096, 8192, and 16384 tokens, the process would be:
- chunk 16384 continuous tokens from somewhere in the text
- the 4096 test will use tokens 12289 through 16384
- the 8192 test will use tokens 8192 through 16384
- the 16384 test will use all 16384 tokens
Since we are using the model's tokenizer to create these slices, then tests with the same maximum context on the same model will have a consistent continuation point and can be directly compared.
This of course is highly simplified since we are trying to end at natural breaking points and we have to leave enough room for the generation and the instructions, but that is the basic idea.
This process allows the evaluator to test for individual factors such as different RoPE values, different quantizations, 16bit, 8bit, and 4bit KV cache, etc.
Example of setting RoPE to optimize at different lengths:
This repository contains two primary analysis tools:
- Context Tester (
benchmark_gui.py) - Measures how LLM output quality degrades as context length increases - Performance Comparison Tool (
plot_gui.py) - Creates comparison plots from test results
cloze_score
Text predictability.
pct_unfamiliar_words
Proportion of words outside the most common 3,000 English words.
vocabulary_diversity
Type-token ratio measuring lexical variety.
unique_word_ratio_100
Ratio of unique words in 100-word chunks.
long_word_ratio
Proportion of words with 7+ letters.
avg_word_length
Mean character count per word.
avg_syllables_per_word
Mean syllable count per word.
function_word_ratio
Proportion of grammatical words (articles, prepositions, conjunctions) versus content words.
avg_sentence_length
Mean word count per sentence.
sentence_length_variance
Variance in sentence lengths.
sentence_length_skewness
Distribution asymmetry.
sentence_length_kurtosis
Extreme outliers in sentence length.
bigram_repetition_rate
Frequency of repeated two-word phrases.
trigram_repetition_rate
Frequency of repeated three-word phrases.
word_entropy
Word-level unpredictability using information theory (bits).
char_entropy
Character-level unpredictability.
comma_density
Commas per sentence.
semicolon_density
Semicolons per sentence.
question_density
Questions per sentence.
exclamation_density
Exclamations per sentence.
adjacent_coherence
Semantic similarity between consecutive sentences.
global_coherence
Overall topical consistency across the entire text.
local_coherence_3sent
Semantic similarity within three-sentence windows.
coherence_variance
Variability in local coherence scores.
- Python 3.13 or higher
- A large text to use as the basis for continuation (txt, pdf, or html)
- An OpenAI-compatible API with a chat completion and embedding endpoint
Clone the repository:
git clone https://github.com/jabberjabberjabber/Context-Tester
cd Context-TesterInstall UV and sync dependencies:
pip install uv
uv syncThe tool works with any OpenAI-compatible API endpoint. An embedding endpoint is required to do coherence tests.
The system uses a unified tokenizer interface with automatic fallback:
- HuggingFace transformers (primary) - Local tokenization with auto-discovery
- KoboldCpp API (fallback) - Remote tokenization endpoint
- Tiktoken (fallback) - OpenAI tokenization
For gated HuggingFace repositories (like Llama models), set the HF_TOKEN environment variable to automatically authenticate.
A text can be txt, pdf, or html. It can be any formatting but better results are obtained if the paragraphs are separated by a blank line and there is no introduction, index, or any other text in it except the story. If it isn't longer than the maximum context length then it will be wrapped around itself.
uv run benchmark_gui.pyTests create a directory in results/ with the format:
results/org-model-text-timestamp/
├── metadata.json # Experiment configuration
├── context_2048_results.json # Results for each context size
├── context_4096_results.json
├── ...
├── degradation_analysis.json # Statistical analysis
├── model-text-timestamp.csv # Aggregate data for plotting
├── model-text-timestamp_generations.txt # All LLM outputs
└── model-text-timestamp.png # Performance graphs
uv run plot_gui.pyThe following environment variables are supported:
API Keys (checked in order):
API_KEYAPI_PASSWORDOPENAI_API_KEYNVIDIA_API_KEYNVAPI_KEY
HuggingFace Token:
HF_TOKEN- For accessing gated models (Llama, etc.)


