World, meet Flint α.
Flint α, meet world.

Flint. A model built for inspiration.

A message from Springboards

While industrial AI races toward reasoning and accuracy, we realised that for creative industries, the 'correct' answer is often the least interesting one. So today, we’re introducing Flint. a model designed specifically to inspire, not give you the answers.

We started Springboards with the mission to spark better ideas in people. But after building for three years, one thing has become impossible to ignore: frontier models are getting smarter, faster, and more polished, while their outputs are getting eerily similar and more repetitive.

For a lawyer or an accountant, convergence can be a feature. For a strategist, writer, marketer, comedian or creative team, it is a bug.

So we built the model we needed ourselves. 

Flint is a small model with big implications. As the first language model designed specifically around inspiration, it achieves a dramatic increase in output diversity on creative tasks WITHOUT degrading performance in other areas. 

In other words, it’s a model with entropy in the right places.

We are releasing an alpha version of Flint today via the Springboards app.

A Divergence Model?

Divergent and convergent thinking are fundamental elements of the creative process.

Divergent thinking
is the act of going wide and exploring possibilities while Convergent thinking narrows those options down to a single solution. While both are important, frontier LLMs are particularly poor tools for divergent thinking due to their limited output diversity. By design or by accident, nearly every LLM in the world converges on the same small set of answers, even for open-ended questions; a phenomenon known as “mode collapse”.

As a result, if used as a tool for brainstorming or ideation, LLMs are likely to lead us all to the same place, and make the world a lot less interesting.

With successive releases, convergence amongst the frontier models has only gotten worse. AI companies are optimising for accuracy across domains like science, mathematics and coding. Hallucination is treated as failure. But there is a whole class of creative and open-ended tasks for which divergence is much more important than accuracy. Flint is built for these tasks, so we have dubbed it a divergence model.

What that means concretely, is that Flint is trained to have higher entropy at key moments in a generation that lead to substantively different answers. Instead of consistently reinforcing the highest-probability path, Flint is trained to produce a higher entropy probability distribution where multiple valid generation paths exist. This allows less obvious ideas/answers to emerge.

The result is structured variation. Less repetition, less slop and more range.

Flint leads SOTA models on NoveltyBench by a distance

NoveltyBench is a benchmark that measures how many meaningfully distinct responses a model generates across ten samples of the same open-ended prompt.

On NoveltyBench, Flint α scores 7.47 mean distinct responses out of 10.

In comparison the SOTA models perform significantly worse.

Gemini 3.1 Pro scores 3.19.

GPT-5.4 scores 2.54.

Claude 4.6 Sonnet scores 1.83.

Flint also more than doubles the NoveltyBench score of its base model, Qwen3-30B-A3B, which scores 3.11. This shows Flint is not just a lightly remixed Qwen. It behaves like a different creative instrument.

Novelty Bench — Mean Distinctk

Average number of meaningfully distinct responses out of 10 samples. Higher = more diverse. Source: novelty-bench.github.io

MEAN_DISTINCT_K

Generations Explorer

Browse 10 generations per prompt per model · see how diversity manifests in actual outputs

NoveltyBench

Explore the difference for yourself!

Flint is less repetitive, even when asked the same thing 50 times

Another test we ran looked at how similar a model’s responses are when given the exact same prompt repeatedly.

We used prompts from the NeurIPS 2025 Artificial Hivemind paper and measured the similarity of outputs using cosine similarity — a standard metric that compares how closely related two responses are.

The scale runs from 0 to 1.0 means completely different.

1 means effectively identical.

When we sample 50 responses to the same query, most models stay within a tight, repetitive band.

Flint does not.

Its mean intra-model similarity is 0.721. Lower is better.

For comparison:
GPT-5.4: 0.864
Gemini 3.1 Pro: 0.871
Claude 4.6 Sonnet: 0.905

Same prompt. Far more range.

Mean Intra-Model Similarity

Average similarity across all queries · lower = more diverse

Mean intra-model similarity

Intra-Model Similarity

Same prompt, 50 times. Flint: 0.721. Everyone else: 0.864–0.905. Lower is better.

The per-query distribution view makes the difference even clearer.

Other models cluster tightly in the high-similarity zone, mostly around 0.8 to 1.0. Flint spreads much wider, with some prompts dropping as low as 0.12 similarity across samples.

On prompts like writing a confession from a mathematician, inventing a new emotion, generating a manifesto from an unusual perspective, or imagining what gravity would feel like in reverse, Flint keeps exploring. Other models tend to return the same answer in slightly different clothes.

This is the heart of the model. Not synthetic randomness. Real divergence on open-ended creative tasks.

Per-Query Similarity Distribution

Each point = one query's mean similarity score across 50 responses

Per-Query Distribution

Other models cluster. Flint ranges.

Flint is not just different from itself. It is different from everyone else.

Across 100 queries, the overall mean inter-model similarity is 0.740.

Flint’s average similarity to other models is just 0.672, making it the most distinctive model in the set.

The most similar pair in the comparison is Claude 4.6 Sonnet and Gemini 3.1 Pro at 0.759.

Pairwise Inter-Model Similarity

Mean cosine similarity between model pairs across 100 queries

Graph of Pairwise Inter-Model Similarity

Inter-Model Similarity

Flint’s average similarity to other models is 0.672, making it the most distinctive model in the comparison.

The bit we are most excited about: we did not break the base model to get here

This is where Flint gets really interesting.

On MMLU-STEM, Flint scores 78.9% overall versus 78.9% for Qwen3-30B-A3B.

What matters is the finding underneath it: divergence tuning does not have to be a tax on capability. You can train a model to range more widely without gutting what it already knows.

Per-Subtask Accuracy

Grouped bars sorted by Flint α accuracy · hover for details

Accuracy

Divergence without collapse: Flint scores 81.5% vs 82.0% for Qwen3-30B-A3B.

Responsible AI

Flint's divergence training preserves its responsible AI performance. On TruthfulQA MC1, Flint scores 34.4% versus 34.0% for Qwen3-30B-A3B and on ToxiGen standard accuracy, Flint leads with 59.6% compared to 58.1%.

What Flint proves

Frontier models are converging. That makes them powerful across reasoning, coding, planning, and knowledge work. But it also makes them predictable. The same patterns reappear. The same shapes repeat. Over time, consistency starts to flatten range.

Flint shows there is another path. We have dramatically increased output diversity without significantly degrading performance on other benchmarks. A small model can expand the space of possible responses, reduce repetition, and stay coherent enough to be useful. That is the breakthrough.

It also points to a better creative workflow. Flint is not a replacement for frontier models. It is a multiplier. Flint generates range. Larger models provide depth, knowledge, and reasoning. Humans apply taste and judgment. Instead of one polished answer, you get multiple starting points worth pursuing.

Flint is available now.
Only in the Springboards app. Give it a go and let us know what you think.