Inside the Black Box: How AI Decides Which Brands to Cite

In the era of generative AI, visibility is about being named, trusted, and cited by the models people use. For brands, this means the question shifts from “Do we appear in Google Search?” to “Does an AI model mention us, cite us, and link to the right page when a user asks a question?”

In this post we unpack how large language models (LLMs) and AI assistants decide which brands to cite and why. We’ll examine three core components: signal ingestion, source and citation selection, and brand presence mechanics. We’ll also highlight research findings and actionable implications for brand-owners.

How models ingest and surface sources

Training + retrieval + output

LLMs such as the one powering ChatGPT operate in two broad phases: (1) training on large-scale data and (2) retrieval or browsing (depending on mode) to provide responses. For example, according to OpenAI:

“We develop our foundation models using (1) publicly available information on the internet, (2) third-party data partnerships, and (3) user-/human-trainer-provided data.” OpenAI Help Center

This means models have broad pattern knowledge of how brands, domains, and web entities interact. When they generate an answer, if a browsing or retrieval step is enabled (or part of the pipeline), they can surface live citations — specific URLs or domains.

Models don’t operate exactly like a search engine (they don’t simply rank via backlinks + PageRank). Instead, they surface content that fits the prompt context, is accessible, and has credible metadata. One implication: citation ≠ ranking, but it is a strong signal of “trusted visibility”.

Key research take-aways on how sources are selected

A recent analysis of 6.8 million citations across ChatGPT, Gemini, and Perplexity found that 86% of citations came from brand-managed sources (websites and listings the brand controls). Yext, Inc.+1
Another experiment comparing ChatGPT vs. Google found that ChatGPT favors branded content more heavily: self-sites (+3.0 pts), competitor domains (+11.1 pts) relative to Google Search. Zenith
Academic work shows that LLMs tend to reflect human citation patterns, but with an even stronger “rich get richer” bias (the so-called Matthew Effect). arXiv

Together, these findings reveal that while citation selection is complex, the broad contours favor brands that have solid web presence, structured ownership of content, and correct metadata/links.

What drives brand citations

Here are the key input signals that models use when choosing which brands to mention and cite:

a) Coverage & fit to the query

If a brand’s page directly answers the user’s question (for example: “Which CRM should a startup use?”) the model is more likely to cite that page. The stronger the alignment of the page’s content with the query intent, the higher the chance of being included.

b) Prominence & positioning

Models appear to weight brand mentions earlier in the answer, and citations that come from pages placed appropriately. One recent study noted:

“Brands mentioned in the first two sentences of an AI response receive ~5× more consideration than brands mentioned later.” Evertune
Thus, being the brand framed first in the answer, or being tied to the page selected by the model, boosts the likelihood of being cited.

c) Evidence & trustworthiness

The model asks (implicitly) “Is this source credible?” Indicators include valid schema, updated date, multiple citations aligning with one brand domain, balanced sources, recency cues. A weak or misaligned source may be passed over.

For example, the “How well do LLMs cite relevant medical references?” study found that up to ~50-90% of model responses had unsupported citations, underscoring that source quality matters. arXiv

d) Risk guard (tone, misattribution, hallucination)

Brands that appear in contexts where tone is negative, or where the model flags potential hallucinations or mis-attributions, will score lower in trust for the model. That means not just appearing, but appearing correctly, consistently, and in appropriate context.

Why this matters for brands

When your brand is mentioned but not cited, or cited via the wrong page (e.g., your homepage when a deeper sub-page would answer the query), you effectively lose authority in the AI channel. This creates a gap: you’re visible, but not trusted.

Because models are increasingly a first point of discovery, being cited, by your page, the correct page, becomes a competitive advantage. Brands that understand these signal mechanics will be ahead of those treating AI as “just another channel”.

Implications & actions for brands

Here are strategic actions based on how AI models decide on citations:

Ensure “best-fit” pages exist for your key questions. If you don’t have a dedicated page that answers the query, you’ll lose alignment.
Put schema and structured data in place. Define the entity, make sure your page has correct markup (mainEntityOfPage, about, description, dateUpdated) so retrieval-augmented models can trust it.
Optimize mention-to-citation flow. Getting your brand mentioned early in the answer helps. That means your page needs to clearly state the brand and topic at the top.
Control your domain presence. Since ~86% of citations are from brand-owned sources, you should ensure you’ve cleaned up domain aliases, canonical tags, and that you own the authoritative content.
Monitor citation patterns across models and locales. Each engine has different behaviours; your brand may need to show up in ChatGPT vs Gemini vs Perplexity via different content emphases.
Reduce “misses” and fix wrong-page citations. If the model cites your homepage instead of your product page, you lose alignment. Fix internal linking, content hierarchy and submit that sub-page to indexing or retrieval pipelines.

What we still don’t fully know

Models’ exact weighting of signals remains proprietary (OpenAI, Google do not publish full weighting details).
The mechanism of real-time page retrieval vs internal training memory differs across products.
How temporal signals (freshness, update date) quantitatively affect citations is still under-researched.
How “entity trust” (brand reputation, mentions across web) mathematically translates to model citation probabilities.

Final thought

In the world of AI-driven discovery, your brand needs to be more than visible, it needs to be trustable, cited, and aligned with the query. Smart brands will treat AI assistants not as search boxes, but as recommendation engines they can influence through content, metadata, and structural signals. By understanding how the models decide which brands to cite, you place yourself in the driver’s seat, being picked, trusted, and linked to when it matters.