December 06, 2019
It’s 2019. You want to scrape a website for some data. What are your choices? Cheerio looks great! Let’s use that. It’s got a simple jquery like syntax to interact with HTML. You write a quick script to download the HTML and run Cheerio on it. But wait, none of the data is there.
The page is written with React and your data is dynamically added to the page. What now. Oh wait there is PhantomJS, that’s a scriptable headless browser! You try to convince yourself it will work even though there is a large sign on the official site stating development is suspended.
You spend 20 minutes trying to install everything and adding it to your PATH. Finally, you just give up, and use homebrew to install it. Your test is now modified to work with PhantomJS. You pay no attention to the multiple JS files being pulled but not executed. You execute PhantomJS command to screenshot the page. Lo and behold all that JS was responsible for rendering in data.
What now? Well there is always Selenium but it looks like it will bringing a nuke to a gun fight. All you need is something that can render the page by executing all the associated Javascript and pull some data using node.js
This is where Nightmare.JS comes into its own, because it basically renders the webpage using Electron and then allows interaction or scrapping using basic word commands. Here is an example from the github:
const Nightmare = require('nightmare')
const nightmare = Nightmare({ show: true })
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'github nightmare')
.click('#search_button_homepage')
.wait('#r1-0 a.result__a')
.evaluate(() => document.querySelector('#r1-0 a.result__a').href)
.end()
.then(console.log)
.catch(error => {
console.error('Search failed:', error)
})
This script will navigate to DuckDuckGo and search for GitHub nightmare. Awesome. Let’s look at a real-world example. For this to really work let’s do something daft like scrape Tik Tok for videos, since this is built using React.JS and their videos are funny.
We will be scrapping the Trending page, for YNW Melly’s song 223s. Loading this page in Chrome will yield roughly 20 videos. Videos use image posters, the videos themselves only kick on user hover. Clicking through to each video does yield the video itself.
So what we will do is, scrape the trending page, get the path to each individual video, go to each individual page of the video and grab the src path to the video and then download it.
Let’s install the Nightmare library, by running npm install nightmare
. This will take a while as it installs electron. Make a file called getLinks.js
or whatever and let’s add nightmare.
const Nightmare = require('nightmare')
const nightmare = Nightmare({ electronPath: require('./node_modules/electron'), show: false, });
I like to define the electron path directly, the show command will render the electron window.
Now, nightmare.js is best used to wait on a item on the page to render, for example on Google you’d probably wait for the search element to be rendered before inputting data. On the Tik Tok page each block holding a individual piece of content is a div with a class of video-feed-item-wrapper
. So we will wait for this to render before we start scraping.
nightmare
.goto('https://www.tiktok.com/music/Hate-Me-5000000001469522003?lang=en')
.wait('.video-feed-item-wrapper')
Now we need to get that selector, and grab the data, this is done via the .evaluate
method, targeting the class we desire.
.evaluate(selector => {
return Array.from(document.querySelectorAll(selector))
.map(element => element.href)
.filter((el) => {
return el && el != ''
});
}, '.video-feed-item-wrapper')
Lets go through this, the selector passed is .video-feed-item-wrapper
which .evaluate(selector =>
uses. We will return any thing with that selectors. We will be executing document.querySelectorAll('.video-feed-item-wrapper')
, which will give use all content blocks.
But we only need the url or the href
, so we convert these objects to an array, use map to get only the href
of each element, filter any empty results and then return all this. Which would be lots of href
links to each individual video page.
Once this operation is done we need to call .end()
and close the electron instance and release the memory. This is the full code:
const Nightmare = require('nightmare')
const nightmare = Nightmare({ electronPath: require('./node_modules/electron'), show: false, });
const fs = require('fs');
nightmare
.goto('https://www.tiktok.com/music/Lucid-Dreams-6562966026735064079?lang=en')
.wait('.video-feed-item-wrapper')
.evaluate(selector => {
return Array.from(document.querySelectorAll(selector))
.map(element => element.href)
.filter((el) => {
return el && el != ''
});
}, '.video-feed-item-wrapper')
.end()
.then( data =>{
const array = Object.values(data);
let list = (JSON.stringify(array, null, 1));
fs.writeFile('json/links.json', list, (err) => {
if (err) throw err;
console.log('JSON saved!');
});
})
.catch(error => {
console.error('scraping failed:', error)
})
We receive the data back in the .then
, where I take all the links and just save out a JSON file called links.json
which will be formatted like so
"https://www.tiktok.com/@dsasda/video/6754411501320228101",
"https://www.tiktok.com/@11.sda/video/6755158308815998213",
"https://www.tiktok.com/@sdasda/video/6754144576254053638",
"https://www.tiktok.com/@lod_dads/video/6754062758381243653",
Now you would think, why can’t I just chain a bunch of nightmare actions to scrape each page? Well each nightmare instance is creating an electron window. It just wouldn’t work. What we need to do is execute all our links, step by step. We can use promise.all
for this functionality. This method would return a promise after completing the execution of all promises passed in, usually in some kind of loop. It is typically used after having started multiple asynchronous tasks to run concurrently and having created promises for their results, so that one can wait for all the tasks being finished according to MDN.
We will load our links.json
, create a nightmare instance, then use the list of links. Now in my testing I had to check if I get proper Tik Tok urls, usually they are formatted as domain / user name / video / udid
. However sometimes I would get an error result where it would just be the domain. We will want to filter that, which I did by just seeing how many /
are present and return only those that fit the pattern.
Then we will need to execute each link inside a nightmarejs instance, find the src
of each video and return this value, then write a JSON file out again. First let’s look at what we end up with:
let list = require('./json/links.json');
const Nightmare = require('nightmare')
const nightmare = Nightmare({ electronPath: require('./node_modules/electron'), show: false, });
const fs = require('fs');
list.filter((url) => {
return url.split('/').length > 4
}).reduce(function (accumulator, url) {
return accumulator.then(function (results) {
return nightmare.goto(url)
.wait('body')
.evaluate(selector => {
return document.querySelectorAll(selector)[0].src
}, 'video')
.then(function (result) {
results.push(result);
return results;
})
});
}, Promise.resolve([])).then(function (results) {
const array = Object.values(results);
let list = (JSON.stringify(array, null, 1));
fs.writeFile('json/videos.json', list, (err) => {
if (err) throw err;
console.log('Videos saved!');
});
return nightmare.end();
})
Lets’ talk through the reduce function and the use of the accumulator, reduce executes a function on each element of an array and then returns a single value. The accumulator accumulates the callback’s return values, which for us in this case would be each src
link of the video. We then use promise.all to collate all these values, which we write to JSON.
The question is this insane? Yes. But it works. I want to call special attention to return nightmare.end();
. Which will end the nightmare instance when we are done otherwise we will never exit.
Now you’ll have a list of video links! But make sure you download them because seemingly the videos are access controlled and the links degrade over time.
Daydream is a great Chrome extension to use to record actions to a nightmare script. I think it would be very useful for more complex interactions where say you have to log in to Facebook or something similar before scarping.
A special section for this, I like using Cheerio to navigate an HTML and do selection. It’s totally possible to do this. Here is what it would look like.
nightmare
.goto(url)
.wait('body')
.evaluate(function(){
return document.body.innerHTML;
})
.then(function(body){
var $ = cheerio.load(body);
console.log(body);
})
However I would probably save the HTML out and then use Cheerio to navigate it. Otherwise you’d be starting Electron everytime.
Hope that was helpful!
Written by Farhad Agzamov who lives and works in London building things. You can follow him on Twitter and check out his github here