Most organizations have never run an entity extraction audit. They have no idea how AI sees them.
This post changes that.
What Entity Extraction Reveals
Entity extraction is not interpretation. It is identification.
When you run extraction on your website, the machine identifies:
- Named entities: Your company name, product names, founder names
- Category entities: Industry classifications, capability domains, market segments
- Relationship entities: How entities connect to each other
- Attribute entities: Characteristics, metrics, statuses
The extraction does not judge whether your claims are correct. It simply identifies what entities appear and how they relate.
The gap between your intended entities and the extracted entities is your semantic misalignment.
I have run hundreds of extraction audits. The gap is always larger than clients expect.
How to Run an Entity Extraction Audit
You need a tool. Several are available:
- Google Natural Language API: Good for general entity extraction, free tier available
- AWS Comprehend: Robust, handles custom entities, pay as you go
- spaCy: Open source, runs locally, requires some technical skill
- IBM Watson Natural Language Understanding: Enterprise features, higher cost
For a first audit, Google Natural Language API is sufficient and accessible.
Step-by-step:
Step One: Select content to audit. Start with your homepage, about page, product page, and two most popular blog posts. Add key landing pages. Total 10-20 pages.
Step Two: Run extraction on each page. For each, the tool will return a list of entities, their types (person, organization, product, etc.), and their salience scores (0 to 1, where 1 is most prominent).
Step Three: Aggregate results across pages. Create a spreadsheet with:
- Entity name
- Entity type (as extracted)
- Average salience (mean across pages)
- Pages where it appears
- Your intended entity type and relationship
Step Four: Compare extracted to intended. For each entity, answer:
- Does the extracted type match your intended type?
- Is the entity as prominent as you expect?
- Are any entities missing entirely?
- Are any unexpected entities appearing?
Step Five: Calculate your misalignment score (as defined in earlier posts). This is the percentage of extracted entity characteristics that align with your intentions.
Step Six: Identify patterns. Which pages have the worst misalignment? Which entity types are most distorted? Where are unexpected entities coming from?
What You Will Find
I can predict with high confidence what your first extraction audit will reveal.
Common finding one: Missing primary entities
Your most important entity (your company name) may have low salience. The machine does not recognize it as central because your content does not emphasize it structurally.
Common finding two: Unexpected category entities
You intend to be a “DeepTech infrastructure” company. The machine extracts “software development” or “IT services.” Generic categories dominate because your specific categories lack semantic density.
Common finding three: Relationship confusion
You intend your product to enable a specific capability. The extraction shows your product related to a different capability because your content uses ambiguous language.
Common finding four: Attribute absence
You intend your product to have specific attributes (speed, security, scale). The extraction shows no attributes because you never structured them as machine‑readable data.
Common finding five: Entity fragmentation
Your company name appears as three variations. The machine treats them as three separate entities. Your trust capital is split across phantom entities.
Every audit reveals these patterns. Every client is surprised.
Case Study: The Extraction Gap
A healthtech company ran their first extraction audit. They intended to be positioned as “AI‑powered clinical decision support for oncology.”
The extraction results:
- Most salient entity: “software company” (salience 0.89)
- Second: “healthcare” (salience 0.76)
- Third: “data analytics” (salience 0.71)
- “Oncology” appeared but with salience 0.12
- “Clinical decision support” did not appear at all
The gap was devastating. The machine saw a generic health data software company. The founders saw an oncology AI pioneer.
We redesigned their semantic architecture. Six months later, re‑audit showed:
- “Oncology” salience 0.78
- “Clinical decision support” salience 0.65
- “AI‑powered” salience 0.58
Discoverability for oncology‑specific queries tripled. Partnership conversations with cancer centers increased fivefold.
The technology had not changed. Only the semantic architecture had changed.
Interpreting Salience Scores
Salience (0 to 1) tells you how prominent an entity is in your content.
0.8 – 1.0: Very high salience. The entity is central. Machines will prioritize it heavily.
0.5 – 0.8: Moderate salience. The entity is important but not dominant.
0.2 – 0.5: Low salience. The entity appears but is not emphasized.
0.0 – 0.2: Very low salience. The entity is peripheral.
Your intended primary entities should have salience above 0.6. If they are lower, your content is not signaling their importance effectively.
Salience patterns to watch:
- Flat salience: All entities have similar scores. No prioritization. Machines cannot tell what matters.
- Wrong salience: An unintended entity has higher salience than your intended primary. Machines are prioritizing the wrong thing.
- Missing salience: Your intended primary entity has very low salience. Machines may not recognize it as an entity at all.
From Audit to Action
Your extraction audit is useless without remediation.
Remediation for missing primary entities:
- Increase entity frequency (mention your company name more often, especially early in content)
- Use entity prominence (put your name in titles, headers, bold text)
- Add structured data that explicitly identifies your company as the primary entity
Remediation for unexpected categories:
- Replace generic language with specific category terms
- Use schema.org sameAs to link your entity to the correct categories in external ontologies
- Create content that explicitly associates your brand with your intended categories
Remediation for relationship confusion:
- Write explicit relationship statements (“Our product enables X capability”)
- Use schema.org relationship properties (hasPart, isCapabilityOf, etc.)
- Build a knowledge graph that maps relationships explicitly
Remediation for attribute absence:
- Convert attributes to structured data (schema.org Property or custom properties)
- Use consistent attribute language across all content
- Include attributes in your narrative ledger
Remediation for entity fragmentation:
- Choose canonical entity names and use them everywhere
- Use sameAs to link variations to the canonical entity
- Update third‑party listings to use canonical names
Running Extraction Audits Quarterly
One audit is not enough. Entity extraction behavior changes as your content changes and as extraction models evolve.
Quarterly audit cadence:
- Baseline: Run full extraction audit on all key pages
- Monthly: Run extraction on your homepage and top 3 landing pages (quick check)
- Quarterly: Run full audit again, compare salience and entity sets to baseline
What to track over time:
- Is the salience of your primary entities increasing?
- Are unintended entities decreasing in salience?
- Are new, aligned entities appearing?
- Is entity fragmentation decreasing?
Set targets:
- Within 6 months: Primary entity salience above 0.6 on all key pages
- Within 12 months: No unintended entity above 0.3 salience
- Within 18 months: Entity fragmentation below 5% (less than 5% of references use non‑canonical names)
Tools for Continuous Extraction Monitoring
Manual audits are time‑consuming. As you mature, automate.
Google Natural Language API: Can be scripted to run weekly on key pages. Results stored in a spreadsheet or database.
Custom dashboards: Build a simple dashboard that queries the API daily and alerts when salience drops below thresholds.
Third‑party SEM tools: Some SEO and content intelligence platforms now include entity extraction and monitoring. Evaluate carefully — many are still keyword‑focused, not entity‑focused.
Open source: spaCy with custom entity recognition can be deployed internally. Requires engineering resources but gives full control.
The Business Case for Extraction Audits
Extraction audits are not academic. They predict business outcomes.
I have tracked extraction audit results against business performance across dozens of organizations.
Correlations:
- Primary entity salience above 0.6 correlates with 2x higher retrieval rates
- Entity fragmentation below 5% correlates with 30% shorter sales cycles
- Alignment between intended and extracted categories correlates with 40% higher conversion rates
Organizations that run extraction audits quarterly improve their scores by an average of 15% per quarter. Organizations that never run audits see gradual degradation.
The cost of an audit (a few hours of tool time, a few hours of analysis) is trivial compared to the cost of invisibility.
Your First Extraction Audit This Week
You do not need permission. You do not need a budget.
Open Google Natural Language API (free tier). Copy your homepage text. Run extraction. Look at the results.
Ask: Is my primary entity the most salient? Are my intended categories present? Are there unexpected entities?
Share the results with your team. The conversation that follows will be more valuable than the audit itself.
Because seeing your brand the way AI sees it is uncomfortable. But discomfort is the beginning of improvement. See yourself as the machines see you. Then fix the gap.