site-audit-seo

5.1.5 • Public • Published

npm npm

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.

Web view report - site-audit-seo-viewer.

Demo:

Русское описание ниже

site-audit-demo

Using without install

Open https://viasite.github.io/site-audit-seo-viewer/.

Features:

  • Crawls the entire site, collects links to pages and documents
  • Does not follow links outside the scanned domain (configurable)
  • Analyse each page with Lighthouse (see below)
  • Analyse main page text with Mozilla Readability and Yake
  • Search pages with SSL mixed content
  • Scan list of urls, --url-list
  • Set default report fields and filters
  • Scan presets
  • Documents with the extensions doc, docx, xls, xlsx, ppt, pptx, pdf, rar, zip are added to the list with a depth == 0

Technical details:

  • Does not load images, css, js (configurable)
  • Each site is saved to a file with a domain name in ~/site-audit-seo/
  • Some URLs are ignored (preRequest in src/scrap-site.js)

XLSX features

  • The first row and the first column are fixed
  • Column width and auto cell height are configured for easy viewing
  • URL, title, description and some other fields are limited in width
  • Title is right-aligned to reveal the common part
  • Validation of some columns (status, request time, description length)
  • Export xlsx to Google Drive and print URL

Web viewer features:

  • Fixed table header and url column
  • Add/remove columns
  • Column presets
  • Field groups by categories
  • Filters presets (ex. h1_count != 1)
  • Color validation
  • Verbose page details (+ button)
  • Direct URL to same report with selected fields, filters, sort
  • Stats for whole scanned pages, validation summary
  • Persistent URL to report when --upload using
  • Switch between last uploaded reports
  • Rescan current report

Fields list (18.08.2020):

  • url
  • mixed_content_url
  • canonical
  • is_canonical
  • previousUrl
  • depth
  • status
  • request_time
  • title
  • h1
  • page_date
  • description
  • keywords
  • og_title
  • og_image
  • schema_types
  • h1_count
  • h2_count
  • h3_count
  • h4_count
  • canonical_count
  • google_amp
  • images
  • images_without_alt
  • images_alt_empty
  • images_outer
  • links
  • links_inner
  • links_outer
  • text_ratio_percent
  • dom_size
  • html_size
  • lighthouse_scores_performance
  • lighthouse_scores_pwa
  • lighthouse_scores_accessibility
  • lighthouse_scores_best-practices
  • lighthouse_scores_seo
  • lighthouse_first-contentful-paint
  • lighthouse_speed-index
  • lighthouse_largest-contentful-paint
  • lighthouse_interactive
  • lighthouse_total-blocking-time
  • lighthouse_cumulative-layout-shift
  • and 150 more lighthouse tests!

Install

Install with docker-compose

git clone https://github.com/viasite/site-audit-seo
cd site-audit-seo
git clone https://github.com/viasite/site-audit-seo-viewer data/front
docker-compose pull # for skip build step
docker-compose up -d

Service will available on http://localhost:5302

Default ports:
  • Backend: 5301
  • Frontend: 5302
  • Yake: 5303

You can change it in .env file or in docker-compose.yml.

Install with NPM:

npm install -g site-audit-seo

For linux users

npm install -g site-audit-seo --unsafe-perm=true

After installing on Ubuntu, you may need to change the owner of the Chrome directory from root to user.

Run this (replace $USER to your username or run from your user, not from root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Error details Invalid file descriptor to ICU data received.

Command line usage:

$ site-audit-seo --help
Usage: site-audit-seo -u https://example.com --upload

Options:
  -u --urls <urls>             Comma separated url list for scan
  -p, --preset <preset>        Table preset (minimal, seo, headers, parse, lighthouse, lighthouse-all) (default: "seo")
  -e, --exclude <fields>       Comma separated fields to exclude from results
  -d, --max-depth <depth>      Max scan depth (default: 10)
  -c, --concurrency <threads>  Threads number (default: by cpu cores)
  --lighthouse                 Appends base Lighthouse fields to preset
  --delay <ms>                 Delay between requests (default: 0)
  -f, --fields <json>          Field in format --field 'title=$("title").text()' (default: [])
  --no-skip-static             Scan static files
  --no-limit-domain            Scan not only current domain
  --docs-extensions            Comma-separated extensions that will be add to table (default: doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zip)
  --follow-xml-sitemap         Follow sitemap.xml (default: false)
  --ignore-robots-txt          Ignore disallowed in robots.txt (default: false)
  -m, --max-requests <num>     Limit max pages scan (default: 0)
  --no-headless                Show browser GUI while scan
  --no-remove-csv              No delete csv after xlsx generate
  --out-dir <dir>              Output directory (default: ".")
  --csv <path>                 Skip scan, only convert csv to xlsx
  --xlsx                       Save as XLSX (default: false)
  --gdrive                     Publish sheet to google docs (default: false)
  --json                       Output results in JSON (default: false)
  --upload                     Upload JSON to public web (default: false)
  --no-color                   No console colors
  --lang <lang>                Language (en, ru, default: system language)
  --open-file                  Open file after scan (default: yes on Windows and MacOS)
  --no-open-file               Don't open file after scan
  --no-console-validate        Don't output validate messages in console
  -V, --version                output the version number
  -h, --help                   display help for command

Custom fields

Linux/Mac:

site-audit-seo -d 1 -u https://example -f 'title=$("title").text()' -f 'h1=$("h1").text()'

Windows:

site-audit-seo -d 1 -u https://example -f title=$('title').text() -f h1=$('h1').text()

Remove fields from results

This will output fields from seo preset excluding canonical fields:

site-audit-seo -u https://example.com --exclude canonical,is_canonical

Lighthouse

Analyse each page with Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Analyse seo + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Config file

You can copy .site-audit-seo.conf.js to your home directory and tune options.

Send to InfluxDB

It is beta feature. How to config:

  1. Add this to ~/.site-audit-seo.conf:
module.exports = {
  influxdb: {
    host: 'influxdb.host',
    port: 8086,
    database: 'telegraf',
    measurement: 'site_audit_seo', // optional
    username: 'user',
    password: 'password',
    maxSendCount: 5, // optional, default send part of pages
  }
};
  1. Use --influxdb-max-send in terminal.

  2. Create command for scan your urls:

site-audit-seo -u https://page-with-url-list.txt --url-list --lighthouse --upload --influxdb-max-send 100 >> ~/log/site-audit-seo.log
  1. Add command to cron.

Plugins

  • Readability - main page text length, reading time
  • Yake - keywords extraction from main page text

See CONTRIBUTING.md for details about plugin development.

Install plugins:

cd data
npm install site-audit-seo-readability
npm install site-audit-seo-yake

Disable plugins:

You can add argument such: --disable-plugins readability,yake. It more faster, but less data extracted.

Credentials

Based on headless-chrome-crawler (puppeteer). Used forked version @popstas/headless-chrome-crawler.

Bugs

  1. Sometimes it writes identical pages to csv. This happens in 2 cases: 1.1. Redirect from another page to this (solved by setting skipRequestedRedirect: true, hardcoded). 1.2. Simultaneous request of the same page in parallel threads.
  2. Sometimes a number appears instead of the URL, it occurs at the stage of converting csv to xlsx, don't know why.

Free audit tools alternatives

Free data scrapers

  • Web Scraper - free for local use extension
  • Portia - self-hosted visual scraper builder, scrapy based
  • Crawlab - distributed web crawler admin platform, self-hosted with Docker
  • OutWit Hub - free edition, pro edition for $99
  • Octoparse - 10 000 records free
  • Parsers.me - 1 000 pages per run free
  • website-scraper - opensource, CLI, download site to local directory
  • website-scraper-puppeteer - same but puppeteer based
  • Gerapy - distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Русский

Сканирование одного или несколько сайтов в файлы csv и xlsx.

Особенности:

  • Обходит весь сайт, собирает ссылки на страницы и документы
  • Сводка результатов после сканирования
  • Документы с расширениями doc, docx, xls, xlsx, pdf, rar, zip добавляются в список с глубиной 0
  • Поиск страниц с SSL mixed content
  • Каждый сайт сохраняется в файл с именем домена
  • Не ходит по ссылкам вне сканируемого домена (настраивается)
  • Не загружает картинки, css, js (настраивается)
  • Некоторые URL игнорируются (preRequest в src/scrap-site.js)
  • Можно прогнать каждую страницу по Lighthouse (см. ниже)
  • Сканирование произвольного списка URL, --url-list

Особенности XLSX:

  • Первый ряд и первая колонка закрепляются
  • Ширина колонок и автоматическая высота ячеек настроены для удобного просмотра
  • URL, title, description и некоторые другие поля ограничены по ширине
  • Title выравнивается по правому краю для выявления общей части
  • Валидация некоторых колонок (status, request time, description length)
  • Загрузка xlsx на Google Drive и вывод ссылки

Установка:

npm install -g site-audit-seo

Если у вас Ubuntu

npm install -g site-audit-seo --unsafe-perm=true
npm run postinstall-puppeteer-fix

Или запустите это (замените $USER на вашего юзера, либо запускайте под юзером, не под root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Подробности ошибки Invalid file descriptor to ICU data received.

Использование

site-audit-seo -u https://example.com --upload

Кастомные поля

Можно передать дополнительные поля так:

site-audit-seo -d 1 -u https://example -f "title=$('title').text()" -f "h1=$('h1').text()"

Lighthouse

Прогнать каждую страницу по Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Обычный seo аудит + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Как посчитать контент по csv

  1. Открыть в блокноте
  2. Документы посчитать поиском ,0
  3. Листалки исключить поиском ?
  4. Вычесть 1 (шапка)

Баги

  1. Иногда пишет в csv одинаковые страницы. Это бывает в 2 случаях: 1.1. Редирект с другой страницы на эту (решается установкой skipRequestedRedirect: true, сделано). 1.2. Одновременный запрос одной и той же страницы в параллельных потоках.
  2. Иногда вместо URL появляется цифра, происходит на этапе конвертации csv в xlsx, не знаю почему.

TODO:

Package Sidebar

Install

npm i site-audit-seo

Weekly Downloads

77

Version

5.1.5

License

ISC

Unpacked Size

9.08 MB

Total Files

41

Last publish

Collaborators

  • popstas