Case Study: Extract All Substack Article Titles and Links. Part D: Generate HTML and Publish
Generate HTML from extracted data and Invoke API to publish the content.
This article series:
Part E: Annotation by Zhimin *
(offering valuable tips for test automation engineers to level up their skills, exclusively available on Substack)
We now have over 500 articles data spread across 21 CSV files. We want to process all of them at once, so combine them into a single CSV file.
Aggregate CSVs
Run the following command (on Unix, macOS, or WSL on Windows) in the folder containing the generated CSV files (in a terminal).
% cat *.csv >> substack-published-articles-aggregated.csv
Create a Ruby script (shown below) to remove duplicate header rows, keeping only the first one.
found_head_row = false
lines = ["Title,Subtitle,Published On,Link"]
File.readlines('substack-published-articles-aggregated.csv', chomp: true).each do |line|
lines << line unless line == "Title,Subtitle,Published On,Link"
end
fio = File.open('substack-published-articles-1127.csv', "w")
lines.each {|line| fio.puts(line) }
fio.close
Run this ruby script in the terminal.
% ruby remove_duplciate_heading_rows.rb
Get the substack-published-articles-1127.csv
file.
Open it in Spreadsheet, it looked fine.
In Spreadsheet software, sort the first column (title). Then save the file, as CSV.
Generate HTML
The task is to generate a HTML from the CSV file. Ruby has built-in easy-to-use CSV parsing library.
require 'csv'
CSV.parse(File.read(file), headers: true, encoding: "UTF-8").each_with_index do |row, idx|
next if idx == 0
data << row
# access data by row["Title"]
end
Next, construct an HTML fragment as a string, in a simplest form. For this exercise, I did not use any HTML generation library, just pure string concatenation.
Still in TestWise, create one RSpec test script to does the generation.
Run the ‘test’ script. The generate HTML file looks like this.
There is no styling currently, which is fine, as it will be provided by the published site.
Upload to SiteWise via API
All my father’s websites on SiteWise CMS (content management system) that my father created. A placeholder for “Substack Article” page has already been created.
I just need to invoke SiteWise API to set the content for this page. It is really just a typical REST POST API testing, very easy to do in Ruby.
require "rest-client"
def post_sitewise_update(slug, content, parent_page_slug = "/")
url_params = "api_key=hidden123&page_full_path=#{slug}&parent_page_full_path=#{parent_page_slug}"
url = "https://agileway.com.au/api/update_page?" + url_params
resp = RestClient.post(url, content, headers = { "Content-Type" => "text/plain" })
puts resp.inspect # => if code is 2xx, means OK
puts resp.body
end
Create another test case (in the same script file) for publishing the article HTML.
Run it in TestWise. It failed with error stack trace.
<RestClient::Response 200 "{:error=>\"e...">
{:error=>"error occurred on update page: Mysql2::Error: Incorrect string value: '\\xF0\\x9F\\x91\\x8E\\xF0\\x9F...' for column 'body' at row 1: UPDATE `page_parts`
This is related to emoji Unicode.
html_published = html_published.gsub(/[^[:alnum:][:blank:][:punct:]]/, '').squeeze(' ').strip
Add the above statement to remove those emojis. Rerun the test.
The Result
The web page: https://agileway.com.au/substack-articles
Readers should find it easy to locate a specific article of interest.
Even my father (the main author) found it helpful. For instance, he quickly spotted a duplicate.
Related reading: