Skip to main content

Scraping Hidden Data

I'm genuinely excited to write this particular entry, for two reasons.  One is that its been some time since I've jumped back in to NodeJS, but the other because is because it is my first attempt at writing something using Puppeteer.  I needed to scrape information about products on a particular website, and through my exploration, on each iteration of the page parameter in the query string, in the Chrome Developer Tools, there was an API request for the granular information that I needed for the products.

It was extremely difficult to work out just how this URL was being constructed in order to know what to request for, but I quickly came to a conclusion that if there was information in the developer tools, then that could be somehow extracted - and remembered this little tool existed - so I hopped to work!

So the premise came across as being rather simple - intercept the browser making the requests that you find in the network tab.  There was one path that I knew how it was formatted at the start of the string, just not the product ID's at the end of which I was scraping for.  So I first came up with the following initial function (after much research!)
Apologies, this is a stripped down version of a finished result, but let me walk you through it...

We are launching the puppeteer and enabling setRequestInterception (I appreciate this needs more documentation).  When we goto a page, all requests for resources by the page will be triggered in our request event, and if the url has got the string we've specified in the path, we're capturing it.  In this example, I am iterating through all pages until a page which does not request this particular API has stopped.

Of course - the resulting API endpoints I'm capturing here is then processed in a separate task, and the result of which contains JSON which I am after to harvest.

As always, hope this inspires.

Comments

Popular posts from this blog

question2answer Wordpress Integration

 Today I want to journal my implementation of a WordPress site with the package of "question2answer".  It comes as self-promoted as being able to integrate with WordPress "out of the box".  I'm going to vent a small amount of frustration here, because the only integration going on is the simplicity of configuration with using the same database, along with the user authentication of WordPress.  Otherwise they run as two separate sites/themes. This will not do. So let's get to some context.  I have a new hobby project in mind which requires a open source stack-overflow clone.  Enter question2answer .  Now I don't want to come across as completely ungrateful, this package - while old, ticks all the boxes and looks like it was well maintained, but I need every  page to look the same to have a seamless integration.  So, let's go through this step by step. Forum Index Update This step probably  doesn't need to be done, but I just wanted to make sure th

Machine Learning: Teaching Wisdom of the Crowd

I got lost in an absolute myriad of thoughts the other day, and it essentially wound up wondering if we can teach machines to count, beyond of what it can see in an image, and I've come up with a small experiment that I would absolutely love to collaborate on if anyone (@ Google ?) else is interested. The idea is based on  the concept of the experiments performed using " Wisdom of the Crowd ", commonly in this experiment to use a jar of jelly beans and asking many people to make a guess as to how many is in there.  Machine learning can be used to make predictions from patterns, but it would have nothing to gain looking at one picture of a jelly bean jar to the next and being able to correctly identify that is in fact - a jar of jelly beans. But suppose we feed it several images of jars of jelly beans, along with all of the guesses people have made of how many is in there.  Can we then presume that feeding it a new image, it would be able to give us a fairly accurate c

WooCommerce: Controlling an Asset CDN

Continuing on from my last post , I faced a new issue when it came to adding products and the associated images I was putting in (from Cloudinary ) was getting uploaded to the WordPress media library. Not only that, using the URL from my site instead of the CDN it had come from. Double up on all of my images, what a waste - and I want to host from the CDN to keep costs of bandwidth down.  So let me show you how I overcame it. Separating the herd What was interesting, is that it was keeping a record of the original source location, and I found I could filter these apart from the rest of my media library: With this in mind, I wrote a function around it so I could use it to give me a true/false if the given attachment was from this source. Attaching the hook Next, needed a way that as soon as an image was added, that it would update the attachment (post) pointing to the correct reference, and not to the file on our server. I found the add_attachment hook, which fires only whe