The MarSec Schema

Entity Extraction Audits: Seeing Your Brand the Way AI Sees It

You think you know what your brand means. You have positioning documents. Messaging frameworks. Brand guidelines. Your team can recite the value proposition. Your website articulates the differentiation. But what you think your brand means is irrelevant. What matters is what AI systems extract. Entity extraction is the process by which machines identify and categorize the entities in your content. The results of that extraction determine whether you are discovered, understood, and trusted.

Latest Posts

From Burnout to Belonging: Why Mission‑Aligned Teams Outperform and Outlast

Burnout is epidemic.
I watch talented professionals leave organizations not because the work is hard, but because the work has lost meaning. They are extracted from, not invested in. They are optimized, not nurtured. They are resources, not humans.
The numbers are staggering. Seventy percent of employees report being disengaged. Fifty percent are actively looking for new roles. Thirty percent have left a job specifically because they lost connection to the mission.
This is not a retention problem. It is a trust problem.
Organizations that treat employees as extractable resources produce burnout. Organizations that treat employees as mission aligned partners produce belonging.

Read More »

The Marketing Engineer: A New Career for a New Discipline

“Marketing engineer” is not a title I invented.
It is a title I observed emerging. First in isolated job postings. Then in more. Then in the portfolios of practitioners who could no longer call themselves simply “marketers” or “strategists.”
Something new is being born.
The marketing engineer sits at the intersection of creativity and structure, of narrative and data, of human psychology and machine logic. They write poetry and build knowledge graphs. They understand trust as both a feeling and a metric. They speak the language of the CMO and the CISO.

Read More »

The Zero‑Trust Marketing Model: Verify Everything, Trust Nothing

Twenty years ago, cybersecurity underwent a revolution.
The old model assumed everything inside the network perimeter could be trusted. Firewalls kept threats out. Everything inside was safe. That model failed spectacularly as threats evolved and perimeters dissolved.
Zero trust security emerged as the replacement. Never trust, always verify. Every request is authenticated. Every access is authorized. Every transaction is monitored.
Marketing is undergoing the same revolution.

Read More »

Most organizations have never run an entity extraction audit. They have no idea how AI sees them.

This post changes that.


What Entity Extraction Reveals

Entity extraction is not interpretation. It is identification.

When you run extraction on your website, the machine identifies:

  • Named entities: Your company name, product names, founder names
  • Category entities: Industry classifications, capability domains, market segments
  • Relationship entities: How entities connect to each other
  • Attribute entities: Characteristics, metrics, statuses

The extraction does not judge whether your claims are correct. It simply identifies what entities appear and how they relate.

The gap between your intended entities and the extracted entities is your semantic misalignment.

I have run hundreds of extraction audits. The gap is always larger than clients expect.


How to Run an Entity Extraction Audit

You need a tool. Several are available:

  • Google Natural Language API: Good for general entity extraction, free tier available
  • AWS Comprehend: Robust, handles custom entities, pay as you go
  • spaCy: Open source, runs locally, requires some technical skill
  • IBM Watson Natural Language Understanding: Enterprise features, higher cost

For a first audit, Google Natural Language API is sufficient and accessible.

Step-by-step:

Step One: Select content to audit. Start with your homepage, about page, product page, and two most popular blog posts. Add key landing pages. Total 10-20 pages.

Step Two: Run extraction on each page. For each, the tool will return a list of entities, their types (person, organization, product, etc.), and their salience scores (0 to 1, where 1 is most prominent).

Step Three: Aggregate results across pages. Create a spreadsheet with:

  • Entity name
  • Entity type (as extracted)
  • Average salience (mean across pages)
  • Pages where it appears
  • Your intended entity type and relationship

Step Four: Compare extracted to intended. For each entity, answer:

  • Does the extracted type match your intended type?
  • Is the entity as prominent as you expect?
  • Are any entities missing entirely?
  • Are any unexpected entities appearing?

Step Five: Calculate your misalignment score (as defined in earlier posts). This is the percentage of extracted entity characteristics that align with your intentions.

Step Six: Identify patterns. Which pages have the worst misalignment? Which entity types are most distorted? Where are unexpected entities coming from?


What You Will Find

I can predict with high confidence what your first extraction audit will reveal.

Common finding one: Missing primary entities

Your most important entity (your company name) may have low salience. The machine does not recognize it as central because your content does not emphasize it structurally.

Common finding two: Unexpected category entities

You intend to be a “DeepTech infrastructure” company. The machine extracts “software development” or “IT services.” Generic categories dominate because your specific categories lack semantic density.

Common finding three: Relationship confusion

You intend your product to enable a specific capability. The extraction shows your product related to a different capability because your content uses ambiguous language.

Common finding four: Attribute absence

You intend your product to have specific attributes (speed, security, scale). The extraction shows no attributes because you never structured them as machine‑readable data.

Common finding five: Entity fragmentation

Your company name appears as three variations. The machine treats them as three separate entities. Your trust capital is split across phantom entities.

Every audit reveals these patterns. Every client is surprised.


Case Study: The Extraction Gap

A healthtech company ran their first extraction audit. They intended to be positioned as “AI‑powered clinical decision support for oncology.”

The extraction results:

  • Most salient entity: “software company” (salience 0.89)
  • Second: “healthcare” (salience 0.76)
  • Third: “data analytics” (salience 0.71)
  • “Oncology” appeared but with salience 0.12
  • “Clinical decision support” did not appear at all

The gap was devastating. The machine saw a generic health data software company. The founders saw an oncology AI pioneer.

We redesigned their semantic architecture. Six months later, re‑audit showed:

  • “Oncology” salience 0.78
  • “Clinical decision support” salience 0.65
  • “AI‑powered” salience 0.58

Discoverability for oncology‑specific queries tripled. Partnership conversations with cancer centers increased fivefold.

The technology had not changed. Only the semantic architecture had changed.


Interpreting Salience Scores

Salience (0 to 1) tells you how prominent an entity is in your content.

0.8 – 1.0: Very high salience. The entity is central. Machines will prioritize it heavily.

0.5 – 0.8: Moderate salience. The entity is important but not dominant.

0.2 – 0.5: Low salience. The entity appears but is not emphasized.

0.0 – 0.2: Very low salience. The entity is peripheral.

Your intended primary entities should have salience above 0.6. If they are lower, your content is not signaling their importance effectively.

Salience patterns to watch:

  • Flat salience: All entities have similar scores. No prioritization. Machines cannot tell what matters.
  • Wrong salience: An unintended entity has higher salience than your intended primary. Machines are prioritizing the wrong thing.
  • Missing salience: Your intended primary entity has very low salience. Machines may not recognize it as an entity at all.

From Audit to Action

Your extraction audit is useless without remediation.

Remediation for missing primary entities:

  • Increase entity frequency (mention your company name more often, especially early in content)
  • Use entity prominence (put your name in titles, headers, bold text)
  • Add structured data that explicitly identifies your company as the primary entity

Remediation for unexpected categories:

  • Replace generic language with specific category terms
  • Use schema.org sameAs to link your entity to the correct categories in external ontologies
  • Create content that explicitly associates your brand with your intended categories

Remediation for relationship confusion:

  • Write explicit relationship statements (“Our product enables X capability”)
  • Use schema.org relationship properties (hasPart, isCapabilityOf, etc.)
  • Build a knowledge graph that maps relationships explicitly

Remediation for attribute absence:

  • Convert attributes to structured data (schema.org Property or custom properties)
  • Use consistent attribute language across all content
  • Include attributes in your narrative ledger

Remediation for entity fragmentation:

  • Choose canonical entity names and use them everywhere
  • Use sameAs to link variations to the canonical entity
  • Update third‑party listings to use canonical names

Running Extraction Audits Quarterly

One audit is not enough. Entity extraction behavior changes as your content changes and as extraction models evolve.

Quarterly audit cadence:

  • Baseline: Run full extraction audit on all key pages
  • Monthly: Run extraction on your homepage and top 3 landing pages (quick check)
  • Quarterly: Run full audit again, compare salience and entity sets to baseline

What to track over time:

  • Is the salience of your primary entities increasing?
  • Are unintended entities decreasing in salience?
  • Are new, aligned entities appearing?
  • Is entity fragmentation decreasing?

Set targets:

  • Within 6 months: Primary entity salience above 0.6 on all key pages
  • Within 12 months: No unintended entity above 0.3 salience
  • Within 18 months: Entity fragmentation below 5% (less than 5% of references use non‑canonical names)

Tools for Continuous Extraction Monitoring

Manual audits are time‑consuming. As you mature, automate.

Google Natural Language API: Can be scripted to run weekly on key pages. Results stored in a spreadsheet or database.

Custom dashboards: Build a simple dashboard that queries the API daily and alerts when salience drops below thresholds.

Third‑party SEM tools: Some SEO and content intelligence platforms now include entity extraction and monitoring. Evaluate carefully — many are still keyword‑focused, not entity‑focused.

Open source: spaCy with custom entity recognition can be deployed internally. Requires engineering resources but gives full control.


The Business Case for Extraction Audits

Extraction audits are not academic. They predict business outcomes.

I have tracked extraction audit results against business performance across dozens of organizations.

Correlations:

  • Primary entity salience above 0.6 correlates with 2x higher retrieval rates
  • Entity fragmentation below 5% correlates with 30% shorter sales cycles
  • Alignment between intended and extracted categories correlates with 40% higher conversion rates

Organizations that run extraction audits quarterly improve their scores by an average of 15% per quarter. Organizations that never run audits see gradual degradation.

The cost of an audit (a few hours of tool time, a few hours of analysis) is trivial compared to the cost of invisibility.


Your First Extraction Audit This Week

You do not need permission. You do not need a budget.

Open Google Natural Language API (free tier). Copy your homepage text. Run extraction. Look at the results.

Ask: Is my primary entity the most salient? Are my intended categories present? Are there unexpected entities?

Share the results with your team. The conversation that follows will be more valuable than the audit itself.

Because seeing your brand the way AI sees it is uncomfortable. But discomfort is the beginning of improvement. See yourself as the machines see you. Then fix the gap.

You cannot copy content of this page