In no event will we (Our Code World) or the developer of this module be liable for direct, indirect, special or other damages arising from the use of this module or the web pages downloaded with this module. Use it at your own risk.
And with our disclaimer we are not talking about your computer exploding when using the module that we will use to copy a website. We just warn that this script should not be used for illegal activities (such as spoofing a website and exposing it on another web domain), but learning more about Node.js and web development.
Having said that, have you ever seen an amazing website with some amazing gadget or widget that you want to have or learn how to do, but can’t find an open source library that does it? Because as a first step, that’s what you need to do first, find an open source library that creates that amazing gadget, and if it exists, implement it in your own project. If you can’t find it, you can use Chrome’s developer tools to inspect the item for a cursory look at how it works and how you could create it yourself. However, if you are not so lucky or do not have the skills to copy a function through the developer tools, you still have a change to do so.
What would be better than having the complete code that creates the awesome widget and editing it however you want (something that will help you understand how the widget works)? That is precisely what you are going to learn in this article, how to download a complete website via its URL with Node.js using a web scraper. Web Scraping (also called Screen Scraping, Web Data Extraction, Web Harvesting, etc.) is a technique used to extract large amounts of data from websites whereby the data is extracted and saved in a local file on your computer or on a database in a table (spreadsheet) format.
Requirements
To download all the resources from a website, we will use the website scraper module . This module allows you to download an entire website (or individual web pages) to a local directory (including all css resources, images, js, fonts, etc.).
Install the module in your project by executing the following command in the terminal:
npm install website-scraper
Note
Dynamic websites (where content is loaded via js) might not save properly because website scraper doesn’t run js, it only parses http responses for html and css files.
Visit the official Github repository for more information here .
1. Download a single page
The scrape function returns a Promise that makes requests to all provided URLs and saves all the files found with the fonts in the directory. The resources will be organized in folders according to the type of resources (css, images or scripts) within the provided directory path. The following script will download the home page of the node.js website:
const scrape = require('website-scraper');
let options = {
urls: ['https://nodejs.org/'],
directory: './node-homepage',
};
scrape(options).then((result) => {
console.log("Website succesfully downloaded");
}).catch((err) => {
console.log("An error ocurred", err);
});
Save the above script to a js file (script.js) and then run it with the node using node index.js
. After the script finishes, the contents of the folder node-homepage
will be:
And the file index.html
from a web browser will look like this:
All scripts, stylesheets were downloaded and the website works flawlessly. Note that the only error shown in the console is due to Google’s parsing script which you obviously have to remove from the code manually.
2. Download multiple pages
If you are downloading multiple pages from a website, you need to provide them simultaneously in the same script, the scraper is smart enough to know that a resource should not be downloaded twice (but only if the resource has already been downloaded from the same website in another page) and will download all the markup files but not the resources that already exist.
In this example, we are going to download 3 pages from the node.js website (index, about, and blog) specified in the urls
property. The content will be saved in the node’s website folder (where the script runs), if it doesn’t exist it will be created. To be more organized, we are going to sort all kinds of resources manually in different folders respectively (images, javascript, css and sources). The fonts property specifies with an array of objects to load, specifies selectors and attribute values to select files to load.
This script is useful if you specifically want some web pages:
const scrape = require('website-scraper');
scrape({
urls: [
'https://nodejs.org/', // Se guardará con el nombre de archivo predeterminado 'index.html'
{
url: 'http://nodejs.org/about',
filename: 'about.html'
},
{
url: 'http://blog.nodejs.org/',
filename: 'blog.html'
}
],
directory: './node-website',
subdirectories: [
{
directory: 'img',
extensions: ['.jpg', '.png', '.svg']
},
{
directory: 'js',
extensions: ['.js']
},
{
directory: 'css',
extensions: ['.css']
},
{
directory: 'fonts',
extensions: ['.woff','.ttf']
}
],
sources: [
{
selector: 'img',
attr: 'src'
},
{
selector: 'link[rel="stylesheet"]',
attr: 'href'
},
{
selector: 'script',
attr: 'src'
}
]
}).then(function (result) {
// Imprime HTML
// console.log(result);
console.log("Contenido descargado con éxito");
}).catch(function (err) {
console.log(err);
});
3. Recursive downloads
Imagine that you need not only specific web pages on a website, but all the pages on it. One way to do this is to use the script above and manually specify all the website URLs that you can get to download, however this can backfire as it will take a long time and you will probably miss some URLs. That is why Scraper offers the recursive download feature that allows you to follow all the links on a page and the links on that page and so on. Obviously that would lead to a very, very long (and almost infinite) loop that you can limit with the maximum depth allowed (property maxDepth
):
const scrape = require('website-scraper');
let options = {
urls: ['https://nodejs.org/'],
directory: './node-homepage',
// Habilitar descarga recursiva
recursive: true,
// Siga solo los enlaces de la primera página (índice)
// entonces no se seguirán los enlaces de otras páginas
maxDepth: 1
};
scrape(options).then((result) => {
console.log("Páginas web descargadas con éxito");
}).catch((err) => {
console.log("An error ocurred", err);
});
The above script should download more pages:
Filter external URLs
As you would expect with any type of website, there will be external URLs that do not belong to the website that you want to copy. To prevent those pages from also being downloaded, you can filter them only if the URL matches the one you use:
const scrape = require('website-scraper');
const websiteUrl = 'https://nodejs.org';
let options = {
urls: [websiteUrl],
directory: './node-homepage',
// Habilitar descarga recursiva
recursive: true,
// Siga solo los enlaces de la primera página (índice)
// entonces no se seguirán los enlaces de otras páginas
maxDepth: 1,
urlFilter: function(url){
// Si la URL contiene el dominio del sitio web, continúe:
// https://nodejs.org with https://nodejs.org/en/example.html
if(url.indexOf(websiteUrl) === 0){
console.log(`URL ${url} matches ${websiteUrl}`);
return true;
}
return false;
},
};
scrape(options).then((result) => {
console.log("Páginas web descargadas con éxito");
}).catch((err) => {
console.log("Ocurrió un error", err);
});
That should decrease the number of downloaded pages in our example:
4. Download a complete website
Note
This task is time consuming, so be patient.
If you want to download an entire website, you can use the recursive download module and increase the maximum depth allowed to a reasonable number (in this example, not so reasonable with 50, but whatever):
// Descarga todos los archivos rastreables de example.com.
// Los archivos se guardan en la misma estructura que la estructura del sitio web, utilizando el generador de nombre de archivo `bySiteStructure`.
// El URLFilter filtra los enlaces a otros sitios web
const scrape = require('website-scraper');
const websiteUrl = 'https://nodejs.org/';
scrape({
urls: [websiteUrl],
urlFilter: function (url) {
return url.indexOf(websiteUrl) === 0;
},
recursive: true,
maxDepth: 50,
prettifyUrls: true,
filenameGenerator: 'bySiteStructure',
directory: './node-website'
}).then((data) => {
console.log("Todo el sitio web descargado con éxito");
}).catch((err) => {
console.log("An error ocurred", err);
});
Final Recommendations
If your website’s CSS or JS code is minimized (and probably all will be), we recommend that you use a beautify mode for the language ( cssbeautify for css or js-beautify for Javascript ) to be able to print and make the code more readable ( not in the same way the original code does, but acceptable).
Have fun ❤️!