A Node.js solution for auditing website health through sitemap analysis. It's designed for SEO audits, identifying broken links, and detecting network errors, including blocked network requests, leveraging Playwright for browser automation.
- 🔍 Sitemap Analysis: - Extract and validate URLs from XML sitemaps
- 🚨 Error Detection: - Identify 400+ HTTP status codes and network failures
- ⚡ Concurrent Processing: - Smart semaphore-based request throttling
- 📊 JSON Reporting: - Structured output for CI/CD integration
- 🌐 Cross-Platform Support: - Works with Playwright.
- 🔄 Auto-Scroll Simulation: - Trigger dynamic content loading
- 🔧 Configurable Thresholds: - Customize batch sizes and connection limits
npm install sitemap-audit
Peer Dependencies (install as needed):
npm install playwright
You can modify the configuration in index.js
or pass values via environment variables.
Option | Default Value | Description |
---|---|---|
resultsFolder |
"results" | Folder where JSON reports are saved. |
batchSize |
20 |
Number of URLs processed at a time. |
maxConnections |
50 |
Max concurrent HTTP requests. |
To check for 400+ HTTP errors, using playwright refer to the below example:
import SiteChecker from "sitemap-audit";
import { test, chromium } from "@playwright/test";
const checker = new SiteChecker();
test("Validate and monitor sitemap URLs", async () => {
test.setTimeout(40000_00); // Provide timeout only if the amount of urls being checked is greater than 200
const browser = await chromium.launch();
const context = await browser.newContext();
// Generate urls from the sitemap.xml
const urls = await checker.fetchAndSplitUrls(
"https://example.com/sitemap.xml"
);
// Check URL statuses
await checker.checkUrlStatus(urls);
// Monitor network requests
await checker.checkAllNetworkRequests(context, urls.slice(0, 20));
await browser.close();
});
Results are saved in results/non-200-responses.json
and results/network-failures.json
.
results/non-200-responses.json
would be save in the following format
[
{ "url": "https://example.com/about", "status": 404 },
{ "url": "https://example.com/safety", "status": 500 }
]
results/network-failures.json
would be save in the following format
[
{
"url": "https://example.com/sites/default/files/downloadable_test_pack.pdf?",
"status": 403,
"resourceType": "fetch",
"initiatingPage": "https://example.com/test"
}
]
fetchAndSplitUrls(sitemapUrl: string): Promise<string[]>
- Fetches and parses sitemap XML
- Returns array of validated URLs
checkUrlStatus(urls: string[]): Promise<void>
- Checks HTTP status codes for URLs
- Saves results to non-200-responses.json
checkAllNetworkRequests(context: BrowserContext, urls: string[]): Promise<void>
- Analyzes network requests during page loads
- Saves resource failures to network-failures.json
Common Issues:
Missing Dependencies: Ensure required browsers drivers are installed
npm install playwright
Timeout Errors: Increase test timeout for large sitemaps
test.setTimeout(120000); // 2-minute timeout
Pull requests welcome! Please follow:
- Create feature branch from main
- Include test coverage
- Update documentation
MIT © Vipin Cheruvallil
For detailed implementation examples and issue tracking, visit our GitHub Repository.