Cleans and extracts a web(site) resource's metadata.
Metadata extraction fields currently supported:
Name | Data Type |
---|---|
author | array (jsonb) |
canonical_url | string |
copyright | string |
date (publish date) | date |
description | text |
favicon | text |
image (primary/og image) | text |
jsonld (structured data) | object (jsonb) |
keywords | array (jsonb) |
lang | string |
locale | string |
origin | string |
publisher | string |
site_name | string |
tags | array (jsonb) |
title | string |
type | string |
truncated_text | text |
status | string |
videos | array (jsonb) |
links | array (jsonb) |
NPM:
$ npm install site-metadata-extractor --save
Yarn:
$ yarn add site-metadata-extractor
Feed in a raw markup from a webpage to get extracted metadata fields.
From .html
file:
import fs from "fs";
import siteMetadataExtractor from "site-metadata-extractor";
const getMetadataFromFile = (filename) => {
const filepath = path.resolve(__dirname, `../data/${filename}.html`);
const markup = fs.readFileSync(filepath).toString();
// feel free to use localhost as the second parameter for testing
const metadata = siteMetadataExtractor(markup, "YOUR_SITE_ORIGIN_HERE");
return metadata;
};
getMetadataFromFile("example");
From a server request:
import axios from 'axios';
import siteMetadataExtractor from 'site-metadata-extractor';
const processSite = async (url) => {
return axios.get(url, config = {})
.then(res => {
const { headers } = res;
const contentType = headers['content-type'];
if (contentType.includes('text/html')) {
return {
body: res.data,
url
};
} else {
return {};
}
})
.catch(err => {
console.log(err);
});
};
processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)
.then((data) => {
...
siteMetadataExtractor(data, "https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/", "en");
...
});
- Run:
git clone https://github.com/sc10ntech/site-metadata-extractor.git
- Change into project directory and install deps:
cd site-metadata-extractor && npm i
site-metadata-extractor was inspired by, and tries to be the spiritual successor to node-unfluff