Case Study: Extract All Substack Article Titles and Links. Part A: Extract Individual Article Data
Using Selenium WebDriver to Extract the title, published date and link from the first Substack article.
This article series:
Part B: Extract 25 articles on one page (coming soon)
Part C: Extract All (coming soon)
Part D: Publish (coming soon)
Part E: Annotation by Zhimin *(coming soon)
(offering valuable tips for test automation engineers to level up their skills, Highly recommended! This article will be exclusively available on Substack.)
The Task
My father has been transferring his articles—and mine—from Medium to Substack. It’s been a significant effort, especially with updating links across hundreds of articles. He assigned me the task of extracting all published Substack articles (titles and links) to compile them into a single page.
It is easy to illustrate with images.
Substack only lists 25 articles per page (and we have 500+)
The result: one page
(check it out at: https://agileway.com.au/substack-articles)
Zhimin: This is an excellent automation scripting exercise. I encourage test automation engineers with a Substack account to give it a try.
Create a RSpec Test Script in TestWise, Execute it into TestWise Debugging Mode
Because Substack has two-factor authentication, it is not easy to run this automation script standalone. TestWise Debugging mode is super effective for this. For more, check out my father’s article below.
Run an empty test case to open “agileway.substack.com” in a Chrome browser.
Log in (manually)
Select a line (does not matter) and right-click to “Run Selected Scripts Against Current Browser”. This enters “TestWise Debugging mode”.
From now on, we operate (edit then run the steps) in the special
debugging_spec.rb
, against the remaining-opened Chrome browser.
Test Design (Rough)
Navigate to the “Published Posts” page.
By default, only the first 25 articles are shown (Substack uses pagination). Running the below test step (in TestWise debugging mode ) to verify that.
article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
puts article_links.count #=> 25
This means we can keep navigating next until reaching the end. Worry about that later.
Extract the data from the First Article
First, try to extract the article title and its actual link. Find the first article, the element.
article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
article_link_elem = article_links.first
# will use it later
Right-click one article and inspect it in Chrome.
The purpose of this is to analyze the HTML fragment, which is quite large.
The bad news is not the size, rather, the actual article link is not present. In other words, we need to an extra step to get it. One method is to click the “View post” link.
article_link_elem.find_element(:tag_name, "button").click
This will open the article in another tab.
This is OK, Selenium WebDriver can handle multi-tabs well (unlike Cypress 😩).
# switch to the last tab
driver.switch_to.window(driver.window_handles[-1])
begin
sleep 0.5
title = driver.find_element(:xpath, "//div[@class='post-header']/h1").text
subtitle = driver.find_element(:xpath, "//h3[@class='subtitle']").text
elem = driver.find_elements(:class, "profile-hover-card-target")[1].find_elements(:xpath, "../../div").last
publish_date = Date.strptime(elem.text, "%b %d, %Y")
the_data = [title, subtitle, publish_date, driver.current_url]
driver.close
ensure
driver.switch_to.window(driver.window_handles.first)
end
The above will extract the article data from the article page (on the second, i.e., the last tab). Once it is done, close the second tab, and switch the focus back to the first tab (to continue).
Save to a CSV file, for Debugging Purposes as well
During the development of this automation script, we usually don’t get it right at first go. So, it is a good habit to produce output for verification.
In this case, I want to print out the extracted article title and links. One good way is to output to a text file.
File.open("/Users/me/tmp.txt", "a").puts(the_data.inspect)
Then run “tail -f /Users/me/tmp.txt”
to check.
["Mac Mini M4 over M1 for Software Development Engineers", "A great machine for software development engineers, too", "2024-11-27", "https://agileway.substack.com/p/mac-mini-m4-over-m1-for-software"]
Extract to Method
The automation script works. It is a good idea to refactor it, given extracting a single article is going to be invoked multiple times (521 times exactly). It is quite logical to extract it to a reusable method.
This can be done manually or using the “Extract to Method” refactoring in TestWise.
I also invoked another refactoring “Move to helper”, so that I could use it in the special debugging_spec.rb
.
The Script (debugging_spec.rb)
I typically include a complete automation script in my articles. However, for this task, inspecting the Substack page source suggests that it may be intentionally designed to make automation more challenging. While automating it with Selenium WebDriver is feasible, my father advised against revealing locators in the script, as Substack might not like it.
# Special test that uses last browser window (from a TestWise run)
# Then you can try Selenium commands directly on the page, without the need to restart from the beginning.
load File.dirname(__FILE__) + "/../test_helper.rb"
require 'time'
describe "DEBUG" do
include TestHelper
before(:all) do
use_current_browser
end
it "Debugging" do
article_links = driver.find_elements(:xpath, "//a[contains(@href, '/publish/posts/detail')]")
puts article_links.count #=> 25
article_link_elem = article_links.first
article_link_elem.find_element(:tag_name, "button").click
# switch to the last tab
driver.switch_to.window(driver.window_handles[-1])
begin
sleep 0.5
title = driver.find_element(:xpath, "//div[@class='post-header']/h1").text
subtitle = driver.find_element(:xpath, "//h3[@class='subtitle']").text
elem = driver.find_elements(:class, "profile-hover-card-target")[1].find_elements(:xpath, "XXX").last
publish_date = Date.strptime(elem.text, "%b %d, %Y")
the_data = [title, subtitle, publish_date.strftime("%F"), driver.current_url]
puts the_data
driver.close
ensure
driver.switch_to.window(driver.window_handles.first)
end
File.open("/Users/me/tmp.txt", "a").puts(the_data.inspect)
end
end