It is quite pricey and tiresome to have to constantly write a new scrapper for every single ecommerce website. However, it doesn’t have to be that way. Keep reading to learn some next-level techniques that you can utilize in order to create a simple scraper that would be able to pinpoint patterns as well as collect information from any ecommerce site.
To start, let’s take a look at some product pages so that we can uncover any design patterns regarding the way that prices are presented on the sites.
For one, the price is always depicted as currency amounts rather than by using words. In addition, the currency amount showing the price is always the biggest font size. However, the ecommerce pricing comes first on the inside at a height of around 600 pixels. On top of that, the price is typically shown above any other currency amounts that may be depicted. That being said, there will always be some exceptions, however, we’ll get into the nitty-gritty later on in this text. Through these current observations, though, we are able to make an effective and comprehensive scraper. Keep reading to learn how.
The Installation Process
The first thing that you will have to do is install Google Chrome. An alternative to it is a programmable version which goes by the name of Puppeteer. Through it, you won’t need to run a GUI application in order to get the scraper running. That being said, it is a bit more complex which is why we’ll stick with Google Chrome in this tutorial.
Chrome Developer Tools
The code that we have given you was made to be as simple as it could be that way it wouldn’t be able to gather the prices from every single product page out on the web. For the sake of simplicity, either go to an Amazon Product page or a Sephora item page, for instance, which you can find via Google Chrome. Through Google Chrome, enter the site, make a right-click on any part of the page, choose “Inspect” in order to open Chrome DevTools, and then clock on the Console tab which you will find in DevTools.
Within the Console tab, you are able to put in any JavaScript code. The browser will then carry out the code while taking into consideration the context of the webpage that you had entered.
Running the Javascript Snippet
You’ll need to begin by adding this specific JavaScript snippet into the console. As soon as it’s there, press “enter” so that you can see the price of the item that is being shown onto the console. In the off chance that you don’t see it, there’s a chance that you may have gone on a product page that’s an exception to those observations. This is totally okay, but we’ll outline what to do in order to expand the script that way it is applicable with a whole lot more item pages similar to these later on.
The Way That it Actually Works
First off, you’ll have to retrieve every single HTML DOM element in the page. You’ll also have to convert every single element in JavaScript objects since they hold onto things like their XY position values, text content, as well as font size. Afterward, you’ll need to convert each element that had been gathered to JavaScript objects. This is accomplished by putting the function to use on every single element via the JavaScript map function. As you do this, try to recall the observations that had been made about how price should be shown. With this in mind, you are able to go through the records and pick the ones that are applicable to the design observations. This can be done via a function that asks if the record goes hand in hand with the design observations. Through the Regular Expression, we were able to see whether or not the text was a currency amount. However, you can alter that regular expression should it not cover any of the web pages that you are playing around with. Afterward, you are able to filter only the records that are potential price records.
Hopefully, in the last step, the price comes out as a currency amount while having the greatest font size. Should there be several currency figures that have the same large font size, the price most likely goes hand in hand with the one that holds the greatest position. The records can be sorted out based on those conditions with the help of the JavaScript sort function. All that is left is to put it onto the console and you should be good to go!
Taking it Up a Notch
One thing that you could try doing is moving it to a scalable program that doesn’t require GUI. For instance, you can change Google Chrome with Puppeteer, which is merely the same thing, but without the brain. However, it happens to be one of the fastest alternatives in comparison to other headless web renderings. In fact, you’ll find that it essentially operates in an ecosystem similar to that of Google Chrome. As soon as you’ve got Puppeteer set up, you are able to insert our script into the headless browser programmatically. Through it, you can get the price to come back to a function within your program.
Another thing that you can do is make the script itself a whole lot better. For example, you’ll probably quickly realize that a few item pages may not work with that kind of script since they don’t abide by the assumptions that had been made with regards to how the product price should be shown along with the patterns that we had uncovered. That being said, though, there is no such thing as the most optimal answer to this issue. However, you can generalize some web pages as well as pinpoint some more patterns so that you can make the scraper better. You can also make it better by:
- Trying to come up with some more features like font weight or color, for example.
- You want to make sure that either the class names or the element ids have the word “price” in them as well as find out any other words that tend to pop up often.
- Those currency amounts that are stroked-through are most likely just regular prices, therefore, you can simply disregard them.
Another thing that you may have to deal with is the fact that some pages do go hand in hand with a few of the design observations while infringing upon others at the same time. Those that we had just outlined, though, are applicable to all observations, therefore, they aren’t in violation of any of them. In order to combat this, try making a score based system; with it, you could give points to those that follow some observations or take them away if they don’t follow any observations. The price, for example, could be the elements scoring higher than a certain point.
Next, you could get Artificial Intelligence or Machine Learning to deal with all of the other pages via techniques used to both pinpoint and categorize patterns in addition to automating the entire process to a whole other level. It is, however, an area that is constantly evolving, therefore, the success may be varied.