It is quite pricey and tiresome to have to constantly write a new scrapper for every single ecommerce website. However, it doesn’t have to be that way. Keep reading to learn some next-level techniques that you can utilize in order to create a simple scraper that would be able to pinpoint patterns as well as collect information from any ecommerce site.
To start, let’s take a look at some product pages so that we can uncover any design patterns regarding the way that prices are presented on the sites.
For one, the price is always depicted as currency amounts rather than by using words. In addition, the currency amount showing the price is always the biggest font size. However, the ecommerce pricing comes first on the inside at a height of around 600 pixels. On top of that, the price is typically shown above any other currency amounts that may be depicted. That being said, there will always be some exceptions, however, we’ll get into the nitty-gritty later on in this text. Through these current observations, though, we are able to make an effective and comprehensive scraper. Keep reading to learn how.
The Installation Process
The first thing that you will have to do is install Google Chrome. An alternative to it is a programmable version which goes by the name of Puppeteer. Through it, you won’t need to run a GUI application in order to get the scraper running. That being said, it is a bit more complex which is why we’ll stick with Google Chrome in this tutorial.
Chrome Developer Tools
The code that we have given you was made to be as simple as it could be that way it wouldn’t be able to gather the prices from every single product page out on the web. For the sake of simplicity, either go to an Amazon Product page or a Sephora item page, for instance, which you can find via Google Chrome. Through Google Chrome, enter the site, make a right-click on any part of the page, choose “Inspect” in order to open Chrome DevTools, and then clock on the Console tab which you will find in DevTools.
The Way That it Actually Works
Taking it Up a Notch
One thing that you could try doing is moving it to a scalable program that doesn’t require GUI. For instance, you can change Google Chrome with Puppeteer, which is merely the same thing, but without the brain. However, it happens to be one of the fastest alternatives in comparison to other headless web renderings. In fact, you’ll find that it essentially operates in an ecosystem similar to that of Google Chrome. As soon as you’ve got Puppeteer set up, you are able to insert our script into the headless browser programmatically. Through it, you can get the price to come back to a function within your program.
Another thing that you can do is make the script itself a whole lot better. For example, you’ll probably quickly realize that a few item pages may not work with that kind of script since they don’t abide by the assumptions that had been made with regards to how the product price should be shown along with the patterns that we had uncovered. That being said, though, there is no such thing as the most optimal answer to this issue. However, you can generalize some web pages as well as pinpoint some more patterns so that you can make the scraper better. You can also make it better by:
- Trying to come up with some more features like font weight or color, for example.
- You want to make sure that either the class names or the element ids have the word “price” in them as well as find out any other words that tend to pop up often.
- Those currency amounts that are stroked-through are most likely just regular prices, therefore, you can simply disregard them.
Another thing that you may have to deal with is the fact that some pages do go hand in hand with a few of the design observations while infringing upon others at the same time. Those that we had just outlined, though, are applicable to all observations, therefore, they aren’t in violation of any of them. In order to combat this, try making a score based system; with it, you could give points to those that follow some observations or take them away if they don’t follow any observations. The price, for example, could be the elements scoring higher than a certain point.
Next, you could get Artificial Intelligence or Machine Learning to deal with all of the other pages via techniques used to both pinpoint and categorize patterns in addition to automating the entire process to a whole other level. It is, however, an area that is constantly evolving, therefore, the success may be varied.