Case Study: Extract All Substack Article Titles and Links. Part B: Extract 25 articles on one page
Add looping.
This article series:
Part E: Annotation by Zhimin Zhan *
(offering valuable tips for test automation engineers to level up their skills, exclusively available on Substack)
Continue from Part A. After successfully extracting the title and link of a single article, proceed to retrieve up to 25 articles from a single Substack list page.
Extract all 25 articles on one page
In the special `debugging_spec.rb` (still in TestWise Debugging mode), change to extract all 25 articles.
article_links.each do |article_link_elem|
the_data = extract_article_data(article_link_elem)
File.open("/Users/me/tmp.csv", "a").puts(the_data.inspect)
end
Please note that I used `
a
` (appending flag) when writing to a file, allowing me to view ongoing data, after multiple attempts.
It was going OK for about 20 seconds.
Then, it failed.
Why? After inspecting the web page and error stack trace shown in TestWise. It was due to unable to click the “View post” button.
Challenge: Scrolling
The reason: When the ‘View post’ button is ‘hidden’ behind the “Ask a question” (layer), it was unclickable.
The solution is simple (and logical), add some scrolling, after done extracting each individual article.
article_links.each do |article_link_elem|
the_data = extract_article_data(article_link_elem)
driver.action.scroll_by(0, 100).perform
end
How did I come up with “100”? Just by experimenting. It happens `100` is a good number.
Get the Proper CSV
The above process was still in the experimentation phase. The generated file wasn't a proper CSV, which wasn't surprising—when opened in a spreadsheet, it appeared incorrect due to an issue with the delimiter.
To create a proper CSV is easy in Ruby.
csv_file = File.join(File.dirname(__FILE__), "..", "substack-published-articles.csv")
CSV.open(csv_file, "w") do |csv|
csv << ["Title", "Subtitle","Published On", "Link"]
story_data.each do |sd|
csv << sd
end
end
Run the automation script again in TestWise.
Verify
After the execution of this automation script, the output `substack-published-articles.csv
’ contains the data for the 25 articles on the first page.
Looks good.