I'm genuinely excited to write this particular entry, for two reasons. One is that its been some time since I've jumped back in to NodeJS, but the other because is because it is my first attempt at writing something using Puppeteer. I needed to scrape information about products on a particular website, and through my exploration, on each iteration of the page parameter in the query string, in the Chrome Developer Tools, there was an API request for the granular information that I needed for the products.
It was extremely difficult to work out just how this URL was being constructed in order to know what to request for, but I quickly came to a conclusion that if there was information in the developer tools, then that could be somehow extracted - and remembered this little tool existed - so I hopped to work!
So the premise came across as being rather simple - intercept the browser making the requests that you find in the network tab. There was one path that I knew how it was formatted at the start of the string, just not the product ID's at the end of which I was scraping for. So I first came up with the following initial function (after much research!)
Apologies, this is a stripped down version of a finished result, but let me walk you through it...
We are launching the puppeteer and enabling setRequestInterception (I appreciate this needs more documentation). When we goto a page, all requests for resources by the page will be triggered in our request event, and if the url has got the string we've specified in the path, we're capturing it. In this example, I am iterating through all pages until a page which does not request this particular API has stopped.
Of course - the resulting API endpoints I'm capturing here is then processed in a separate task, and the result of which contains JSON which I am after to harvest.
As always, hope this inspires.
It was extremely difficult to work out just how this URL was being constructed in order to know what to request for, but I quickly came to a conclusion that if there was information in the developer tools, then that could be somehow extracted - and remembered this little tool existed - so I hopped to work!
So the premise came across as being rather simple - intercept the browser making the requests that you find in the network tab. There was one path that I knew how it was formatted at the start of the string, just not the product ID's at the end of which I was scraping for. So I first came up with the following initial function (after much research!)
Apologies, this is a stripped down version of a finished result, but let me walk you through it...
We are launching the puppeteer and enabling setRequestInterception (I appreciate this needs more documentation). When we goto a page, all requests for resources by the page will be triggered in our request event, and if the url has got the string we've specified in the path, we're capturing it. In this example, I am iterating through all pages until a page which does not request this particular API has stopped.
Of course - the resulting API endpoints I'm capturing here is then processed in a separate task, and the result of which contains JSON which I am after to harvest.
As always, hope this inspires.
Comments
Post a Comment