The MarSec Schema

Entity Extraction Audits: Seeing Your Brand the Way AI Sees It

You think you know what your brand means. You have positioning documents. Messaging frameworks. Brand guidelines. Your team can recite the value proposition. Your website articulates the differentiation. But what you think your brand means is irrelevant. What matters is what AI systems extract. Entity extraction is the process by which machines identify and categorize the entities in your content. The results of that extraction determine whether you are discovered, understood, and trusted.

Latest Posts

The Trust Auditor: Training Non‑Technical Teams to Protect Narrative Integrity

You have a narrative ledger. You have structured data. You have monitoring tools.
But the person updating your LinkedIn company page is an intern. The person responding to G2 reviews is a customer support agent. The person writing your podcast descriptions is a content coordinator.
If these team members do not understand narrative integrity, your infrastructure is useless.
The strongest cybersecurity strategy does not start with a firewall. It starts with humans: aware, aligned, resilient. The same is true for narrative security.
You need to train every person who touches your digital footprint to be a trust auditor.

May 26, 2026 No Comments

The Distributed Content Architecture: Managing Fragments Across Your Entire Digital Footprint

Your brand is not a single narrative. It is thousands of fragments distributed across dozens of platforms, each with its own structure, each with its own retrieval logic.
A podcast episode mentions your product. A Reddit comment describes your service. A review site user posts a photo of your packaging. A partner’s LinkedIn article quotes your CEO. A forum thread links to your documentation.
Each fragment is a data point for AI retrieval systems. Each fragment can be accurate or distorted. Each fragment contributes to your trust density or detracts from it.
You cannot control every fragment. But you can architect a system that makes accurate fragments more likely and distorted fragments less damaging.
This is distributed content architecture.

May 26, 2026 No Comments

Optimizing for Social AI: How Recommendation Engines Discover Your Brand

Social media algorithms are AI agents.
They read your content before humans do. They extract entities. They categorize your brand. They decide whether to surface your posts to followers or suppress them.
But unlike LLM based assistants, social AI agents have a different objective: maximize engagement and time on platform. They are not trying to answer questions accurately. They are trying to predict what content will keep users scrolling.
This changes how you optimize.
Optimizing for Google’s search AI is about verifiability and relevance. Optimizing for LinkedIn’s feed AI is about engagement prediction and entity coherence.
You need both.

May 26, 2026 No Comments

Most organizations have never run an entity extraction audit. They have no idea how AI sees them.

This post changes that.

What Entity Extraction Reveals

Entity extraction is not interpretation. It is identification.

When you run extraction on your website, the machine identifies:

Named entities: Your company name, product names, founder names
Category entities: Industry classifications, capability domains, market segments
Relationship entities: How entities connect to each other
Attribute entities: Characteristics, metrics, statuses

The extraction does not judge whether your claims are correct. It simply identifies what entities appear and how they relate.

The gap between your intended entities and the extracted entities is your semantic misalignment.

I have run hundreds of extraction audits. The gap is always larger than clients expect.

How to Run an Entity Extraction Audit

You need a tool. Several are available:

Google Natural Language API: Good for general entity extraction, free tier available
AWS Comprehend: Robust, handles custom entities, pay as you go
spaCy: Open source, runs locally, requires some technical skill
IBM Watson Natural Language Understanding: Enterprise features, higher cost

For a first audit, Google Natural Language API is sufficient and accessible.

Step-by-step:

Step One: Select content to audit. Start with your homepage, about page, product page, and two most popular blog posts. Add key landing pages. Total 10-20 pages.

Step Two: Run extraction on each page. For each, the tool will return a list of entities, their types (person, organization, product, etc.), and their salience scores (0 to 1, where 1 is most prominent).

Step Three: Aggregate results across pages. Create a spreadsheet with:

Entity name
Entity type (as extracted)
Average salience (mean across pages)
Pages where it appears
Your intended entity type and relationship

Step Four: Compare extracted to intended. For each entity, answer:

Does the extracted type match your intended type?
Is the entity as prominent as you expect?
Are any entities missing entirely?
Are any unexpected entities appearing?

Step Five: Calculate your misalignment score (as defined in earlier posts). This is the percentage of extracted entity characteristics that align with your intentions.

Step Six: Identify patterns. Which pages have the worst misalignment? Which entity types are most distorted? Where are unexpected entities coming from?

What You Will Find

I can predict with high confidence what your first extraction audit will reveal.

Common finding one: Missing primary entities

Your most important entity (your company name) may have low salience. The machine does not recognize it as central because your content does not emphasize it structurally.

Common finding two: Unexpected category entities

You intend to be a “DeepTech infrastructure” company. The machine extracts “software development” or “IT services.” Generic categories dominate because your specific categories lack semantic density.

Common finding three: Relationship confusion

You intend your product to enable a specific capability. The extraction shows your product related to a different capability because your content uses ambiguous language.

Common finding four: Attribute absence

You intend your product to have specific attributes (speed, security, scale). The extraction shows no attributes because you never structured them as machine‑readable data.

Common finding five: Entity fragmentation

Your company name appears as three variations. The machine treats them as three separate entities. Your trust capital is split across phantom entities.

Every audit reveals these patterns. Every client is surprised.

Case Study: The Extraction Gap

A healthtech company ran their first extraction audit. They intended to be positioned as “AI‑powered clinical decision support for oncology.”

The extraction results:

Most salient entity: “software company” (salience 0.89)
Second: “healthcare” (salience 0.76)
Third: “data analytics” (salience 0.71)
“Oncology” appeared but with salience 0.12
“Clinical decision support” did not appear at all

The gap was devastating. The machine saw a generic health data software company. The founders saw an oncology AI pioneer.

We redesigned their semantic architecture. Six months later, re‑audit showed:

“Oncology” salience 0.78
“Clinical decision support” salience 0.65
“AI‑powered” salience 0.58

Discoverability for oncology‑specific queries tripled. Partnership conversations with cancer centers increased fivefold.

The technology had not changed. Only the semantic architecture had changed.

Interpreting Salience Scores

Salience (0 to 1) tells you how prominent an entity is in your content.

0.8 – 1.0: Very high salience. The entity is central. Machines will prioritize it heavily.

0.5 – 0.8: Moderate salience. The entity is important but not dominant.

0.2 – 0.5: Low salience. The entity appears but is not emphasized.

0.0 – 0.2: Very low salience. The entity is peripheral.

Your intended primary entities should have salience above 0.6. If they are lower, your content is not signaling their importance effectively.

Salience patterns to watch:

Flat salience: All entities have similar scores. No prioritization. Machines cannot tell what matters.
Wrong salience: An unintended entity has higher salience than your intended primary. Machines are prioritizing the wrong thing.
Missing salience: Your intended primary entity has very low salience. Machines may not recognize it as an entity at all.

From Audit to Action

Your extraction audit is useless without remediation.

Remediation for missing primary entities:

Increase entity frequency (mention your company name more often, especially early in content)
Use entity prominence (put your name in titles, headers, bold text)
Add structured data that explicitly identifies your company as the primary entity

Remediation for unexpected categories:

Replace generic language with specific category terms
Use schema.org sameAs to link your entity to the correct categories in external ontologies
Create content that explicitly associates your brand with your intended categories

Remediation for relationship confusion:

Write explicit relationship statements (“Our product enables X capability”)
Use schema.org relationship properties (hasPart, isCapabilityOf, etc.)
Build a knowledge graph that maps relationships explicitly

Remediation for attribute absence:

Convert attributes to structured data (schema.org Property or custom properties)
Use consistent attribute language across all content
Include attributes in your narrative ledger

Remediation for entity fragmentation:

Choose canonical entity names and use them everywhere
Use sameAs to link variations to the canonical entity
Update third‑party listings to use canonical names

Running Extraction Audits Quarterly

One audit is not enough. Entity extraction behavior changes as your content changes and as extraction models evolve.

Quarterly audit cadence:

Baseline: Run full extraction audit on all key pages
Monthly: Run extraction on your homepage and top 3 landing pages (quick check)
Quarterly: Run full audit again, compare salience and entity sets to baseline

What to track over time:

Is the salience of your primary entities increasing?
Are unintended entities decreasing in salience?
Are new, aligned entities appearing?
Is entity fragmentation decreasing?

Set targets:

Within 6 months: Primary entity salience above 0.6 on all key pages
Within 12 months: No unintended entity above 0.3 salience
Within 18 months: Entity fragmentation below 5% (less than 5% of references use non‑canonical names)

Tools for Continuous Extraction Monitoring

Manual audits are time‑consuming. As you mature, automate.

Google Natural Language API: Can be scripted to run weekly on key pages. Results stored in a spreadsheet or database.

Custom dashboards: Build a simple dashboard that queries the API daily and alerts when salience drops below thresholds.

Third‑party SEM tools: Some SEO and content intelligence platforms now include entity extraction and monitoring. Evaluate carefully — many are still keyword‑focused, not entity‑focused.

Open source: spaCy with custom entity recognition can be deployed internally. Requires engineering resources but gives full control.

The Business Case for Extraction Audits

Extraction audits are not academic. They predict business outcomes.

I have tracked extraction audit results against business performance across dozens of organizations.

Correlations:

Primary entity salience above 0.6 correlates with 2x higher retrieval rates
Entity fragmentation below 5% correlates with 30% shorter sales cycles
Alignment between intended and extracted categories correlates with 40% higher conversion rates

Organizations that run extraction audits quarterly improve their scores by an average of 15% per quarter. Organizations that never run audits see gradual degradation.

The cost of an audit (a few hours of tool time, a few hours of analysis) is trivial compared to the cost of invisibility.

Your First Extraction Audit This Week

You do not need permission. You do not need a budget.

Open Google Natural Language API (free tier). Copy your homepage text. Run extraction. Look at the results.

Ask: Is my primary entity the most salient? Are my intended categories present? Are there unexpected entities?

Share the results with your team. The conversation that follows will be more valuable than the audit itself.

Because seeing your brand the way AI sees it is uncomfortable. But discomfort is the beginning of improvement. See yourself as the machines see you. Then fix the gap.