A PHP library for extracting rich metadata from web pages using multiple strategies: oEmbed, OpenGraph, JSON-LD, Twitter Cards, and HTML fallbacks.
- Multi-Source Extraction: Pulls metadata from oEmbed, OpenGraph, JSON-LD, Twitter Cards, and standard HTML
- Priority-Based Merging: Higher-quality sources take precedence (oEmbed → OpenGraph → JSON-LD → Twitter → HTML)
- Flexible Input: Fetch URLs directly or process pre-fetched HTML content
- Typed Value Object: Returns structured
Metadataobject with typed properties - Known Provider Support: Fast-path for popular platforms (YouTube, Vimeo, Twitter/X, Instagram, etc.)
- oEmbed Discovery: Automatic discovery via HTML link tags
- Configurable Error Handling: Choose between exceptions or graceful degradation
- PHP 8.2 or higher
- ext-dom (included with PHP)
- ext-json (included with PHP)
- Guzzle 7.0+ (for HTTP requests)
composer require dealnews/metadatause DealNews\Metadata\MetadataExtractor;
$extractor = new MetadataExtractor();
$metadata = $extractor->extract('https://example.com/article');
echo $metadata->title; // "Article Title"
echo $metadata->description; // "Article description..."
echo $metadata->image_url; // "https://example.com/image.jpg"
echo $metadata->author; // "John Doe"$extractor = new MetadataExtractor();
$html = '<html><head><title>My Page</title>...';
$metadata = $extractor->extract($html, false); // false = not a URL$extractor = new MetadataExtractor([
'throw_on_http_error' => true, // Throw exceptions on HTTP failures
'http_timeout' => 15, // Request timeout in seconds
'user_agent' => 'MyBot/1.0', // Custom user agent
]);The Metadata object contains the following properties:
| Property | Type | Description |
|---|---|---|
title |
?string |
Page title |
description |
?string |
Page description |
url |
?string |
Canonical URL |
image_url |
?string |
Primary image URL |
image_width |
?int |
Image width in pixels |
image_height |
?int |
Image height in pixels |
type |
?string |
Content type (article, video, etc.) |
site_name |
?string |
Name of the website/publisher |
author |
?string |
Author name |
published_time |
?string |
Publication date/time (ISO 8601) |
modified_time |
?string |
Last modified date/time (ISO 8601) |
oembed_html |
?string |
Embedded HTML from oEmbed |
oembed_type |
?string |
oEmbed type (video, photo, rich, link) |
All fields are nullable and will be null if not found.
The library runs extractors in this order and merges results:
- oEmbed (provider registry + discovery)
- OpenGraph (og:* meta tags)
- JSON-LD (schema.org structured data)
- Twitter Cards (twitter:* meta tags)
- HTML (title, meta description, canonical link)
Later extractors only fill fields that are still null - they won't overwrite data from higher-priority sources.
The library includes built-in support for popular oEmbed providers:
- YouTube
- Vimeo
- Twitter/X
- TikTok
- SoundCloud
- Spotify
For sites not in the registry, the library automatically looks for oEmbed discovery links:
<link rel="alternate" type="application/json+oembed" href="...">Note: oEmbed endpoints may require API keys or have usage limits. These are the caller's responsibility to manage.
By default, the library returns partial results on errors:
$extractor = new MetadataExtractor();
$metadata = $extractor->extract('https://nonexistent.example.com');
// Returns empty Metadata object, no exception thrownEnable exceptions for HTTP errors:
$extractor = new MetadataExtractor(['throw_on_http_error' => true]);
try {
$metadata = $extractor->extract('https://nonexistent.example.com');
} catch (\GuzzleHttp\Exception\GuzzleException $e) {
// Handle HTTP error
}Image and canonical URLs are resolved against the base URL when possible:
// HTML: <link rel="canonical" href="/page">
// Base URL: https://example.com/other
// Result: https://example.com/pageThe library handles pages with multiple <script type="application/ld+json"> blocks and @graph structures.
Some sites block requests without proper User-Agent headers. The library includes a default:
Mozilla/5.0 (compatible; MetadataBot/1.0)
Customize if needed:
$extractor = new MetadataExtractor([
'user_agent' => 'MyCustomBot/2.0',
]);The library uses PHP's DOMDocument for HTML parsing, which handles most encoding issues automatically via libxml.
composer install
./vendor/bin/phpunit tests/./vendor/bin/phpunit tests/ --coverage-html coverage/BSD-3-Clause. See LICENSE file for details.
This is a DealNews internal library. For issues or questions, contact the development team.