Case Study: Extract All Substack Article Titles and Links. Part A: Extract Individual Article Data

Using Selenium WebDriver to Extract the title, published date and link from the first Substack article.

Nov 30, 2024

This article series:

Part A: Extract Individual Article Data
Part B: Extract 25 articles on one page
Part C: Extract All
Part D: Publish
Part E: Annotation by Zhimin *
(offering valuable tips for test automation engineers to level up their skills, Highly recommended! This article will be exclusively available on Substack.)

The Task

My father has been transferring his articles—and mine—from Medium to Substack. It’s been a significant effort, especially with updating links across hundreds of articles. He assigned me the task of extracting all published Substack articles (titles and links) to compile them into a single page.

It is easy to illustrate with images.

Substack only lists 25 articles per page (and we have 500+)

The result: one page
(check it out at: https://agileway.com.au/substack-articles)

Zhimin: This is an excellent automation scripting exercise. I encourage test automation engineers with a Substack account to give it a try.

Create a RSpec Test Script in TestWise, Execute it into TestWise Debugging Mode

Because Substack has two-factor authentication, it is not easy to run this automation script standalone. TestWise Debugging mode is super effective for this. For more, check out my father’s article below.

My Innovative Solution to Test Automation: Debug Automated E2E Web Test Scripts Intuitively and Proficiently

Run an empty test case to open “agileway.substack.com” in a Chrome browser.

Log in (manually)
Select a line (does not matter) and right-click to “Run Selected Scripts Against Current Browser”. This enters “TestWise Debugging mode”.

From now on, we operate (edit then run the steps) in the special debugging_spec.rb, against the remaining-opened Chrome browser.

Test Design (Rough)

Navigate to the “Published Posts” page.

By default, only the first 25 articles are shown (Substack uses pagination). Running the below test step (in TestWise debugging mode ) to verify that.

  article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
  puts article_links.count #=> 25

This means we can keep navigating next until reaching the end. Worry about that later.

Extract the data from the First Article

First, try to extract the article title and its actual link. Find the first article, the element.

article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
article_link_elem = article_links.first
# will use it later

Right-click one article and inspect it in Chrome.

The purpose of this is to analyze the HTML fragment, which is quite large.

The bad news is not the size, rather, the actual article link is not present. In other words, we need to an extra step to get it. One method is to click the “View post” link.

article_link_elem.find_element(:tag_name, "button").click

This will open the article in another tab.

This is OK, Selenium WebDriver can handle multi-tabs well (unlike Cypress 😩).

# switch to the last tab
driver.switch_to.window(driver.window_handles[-1])
begin
  sleep 0.5
  title = driver.find_element(:xpath, "//div[@class='post-header']/h1").text
  subtitle = driver.find_element(:xpath, "//h3[@class='subtitle']").text
  elem = driver.find_elements(:class, "profile-hover-card-target")[1].find_elements(:xpath, "../../div").last
  publish_date = Date.strptime(elem.text, "%b %d, %Y")

  the_data = [title, subtitle, publish_date, driver.current_url]
  driver.close
ensure
  driver.switch_to.window(driver.window_handles.first)
end

The above will extract the article data from the article page (on the second, i.e., the last tab). Once it is done, close the second tab, and switch the focus back to the first tab (to continue).

Save to a CSV file, for Debugging Purposes as well

During the development of this automation script, we usually don’t get it right at first go. So, it is a good habit to produce output for verification.

In this case, I want to print out the extracted article title and links. One good way is to output to a text file.

  File.open("/Users/me/tmp.txt", "a").puts(the_data.inspect)

Then run “tail -f /Users/me/tmp.txt” to check.

["Mac Mini M4 over M1 for Software Development Engineers", "A great machine for software development engineers, too", "2024-11-27", "https://agileway.substack.com/p/mac-mini-m4-over-m1-for-software"]

Extract to Method

The automation script works. It is a good idea to refactor it, given extracting a single article is going to be invoked multiple times (521 times exactly). It is quite logical to extract it to a reusable method.

This can be done manually or using the “Extract to Method” refactoring in TestWise.

I also invoked another refactoring “Move to helper”, so that I could use it in the special debugging_spec.rb.

The Script (debugging_spec.rb)

I typically include a complete automation script in my articles. However, for this task, inspecting the Substack page source suggests that it may be intentionally designed to make automation more challenging. While automating it with Selenium WebDriver is feasible, my father advised against revealing locators in the script, as Substack might not like it.

# Special test that uses last browser window (from a TestWise run)
# Then you can try Selenium commands directly on the page, without the need to restart from the beginning.

load File.dirname(__FILE__) + "/../test_helper.rb"
require 'time'

describe "DEBUG" do
  include TestHelper

  before(:all) do
    use_current_browser
  end

  it "Debugging" do
    article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
    puts article_links.count #=> 25
    article_link_elem = article_links.first
    article_link_elem.find_element(:tag_name, "button").click

    # switch to the last tab
    driver.switch_to.window(driver.window_handles[-1])
    begin
      sleep 0.5
      title = driver.find_element(:xpath, "//div[@class='post-header']/h1").text
      subtitle = driver.find_element(:xpath, "//h3[@class='subtitle']").text
      elem = driver.find_elements(:class, "profile-hover-card-target")[1].find_elements(:xpath, "XXX").last
      publish_date = Date.strptime(elem.text, "%b %d, %Y")
      the_data = [title, subtitle, publish_date.strftime("%F"), driver.current_url]
      puts the_data
      driver.close
    ensure
      driver.switch_to.window(driver.window_handles.first)
    end
    
      File.open("/Users/me/tmp.txt", "a").puts(the_data.inspect)
  end
end

The Agile Way

Discussion about this post