How to automate and scrape websites

The two methods of automating and scraping websites

When it comes to website automation, we have two options: http requests and Browser automation.

There is also desktop automation and browser extensions, but those are terrible and not scalable at all, they should only be used by non programmers.

You can use both methods in the language of your choice. For http requests literally every language has a library available, for Browser automation you have two choices: Selenium and Puppeteer.

HTTP Requests

HTTP is the protocol in which most of the web works. When you want to read a website, your browser sends a GET request to the server to retrieve the page and then display it to you, when you want to login to some website the browser sends a POST with your login info etc.

You can read more about http requests here.

What you will do is use a programming language of your choice to send requests to the website to get/set data.

Pros

Easy to spoof
Best performance

Cons

Somewhat harder to program
Can't execute JavaScript by default

A way to check if a website can be easily automated using requests is to disable JavaScript on your browser and test the site, if the website's functionality works normally, then it can be automated with requests.

To get the correct requests, if it's simply reading a webpage's contents, you just have to send a GET request to the page. For more complex actions, like logging in, do the following:

Open your browser's network tab, for firefox the shortcut is Ctrl + Shift + E
Perform the action you need, like logging in.
The request will appear in the tab, in this case usually a POST request.
Click the requets and read the parameters.
Mimic the request in your program/script.

Browser automation

Browser automation is using a programming language to control a browser. There are two options for this: Selenium, which has several language bindings like Ruby, Python, Java etc; and puppeteer, which can be used with JavaScript. Also Selenium supports several browsers, while Puppeteer supports only chromium and derivatives (which are many).

The most common choice those days is Puppeteer due to having better performance and due to the fact that JavaScript and Node.js being extremely popular. This means websites trying to prevent browser automation, are going to target Puppeteer first, I have found many sites that detect Puppeteer, but don't detect Selenium + Firefox.

Pros

Simpler and easier to program
Can easily execute JavaScript
Works on any site, given enough spoofing

Cons

Much more resource intensive
Harder to spoof

Browser automation should be your choice when the website in question doesn't work without JavaScript, spoofing it is much harder, as JavaScript expose lots of vectors for browser detection and fingerprinting.

Also, if you're not going to scale your project (eg. run several bots at once), browser automation is a good choice as it is easier to write the code.

Examples

Login to reddit using http requests with JavaScript:

  const request = require('request');
  var config = {
    headers: {'User-Agent': 'Insert a common browser user agent here'},
    followAllRedirects: true,
    url: 'https://reddit.com/post/login',
    form: {
      user: 'the username',
      passwd: 'the password'
    }
  };
  var callback = function(error, response, body) {
    console.log('Logged in!');
  };
  request.post(config, callback);

Login to reddit using browser automation with Ruby + Selenium:

  require 'watir'

  browser = Watir::Browser.new :firefox, headless: true
  browser.goto 'https://old.reddit.com/'
  browser.text_field(name: 'user').set 'username'
  browser.text_field(name: 'passwd').set 'password'
  browser.checkbox(id: 'rem-login-main').set
  browser.button(text: 'login').click
  puts 'Logged in!'

24/01/2020