Case Study: Extract All Substack Article Titles and Links. Part D: Generate HTML and Publish

Generate HTML from extracted data and Invoke API to publish the content.

Dec 08, 2024

This article series:

Part A: Extract Article Data
Part B: Extract 25 articles on one page
Part C: Extract All
Part D: Generate HTML and Publish
Part E: Annotation by Zhimin *
(offering valuable tips for test automation engineers to level up their skills, exclusively available on Substack)

We now have over 500 articles data spread across 21 CSV files. We want to process all of them at once, so combine them into a single CSV file.

Aggregate CSVs

Run the following command (on Unix, macOS, or WSL on Windows) in the folder containing the generated CSV files (in a terminal).

% cat *.csv >> substack-published-articles-aggregated.csv

Create a Ruby script (shown below) to remove duplicate header rows, keeping only the first one.

found_head_row = false
lines = ["Title,Subtitle,Published On,Link"]
File.readlines('substack-published-articles-aggregated.csv', chomp: true).each do |line|
  lines << line unless line == "Title,Subtitle,Published On,Link"
end
fio = File.open('substack-published-articles-1127.csv', "w")
lines.each {|line| fio.puts(line) }
fio.close

Run this ruby script in the terminal.

 % ruby remove_duplciate_heading_rows.rb

Get the substack-published-articles-1127.csv file. Open it in Spreadsheet, it looked fine.

In Spreadsheet software, sort the first column (title). Then save the file, as CSV.

Generate HTML

The task is to generate a HTML from the CSV file. Ruby has built-in easy-to-use CSV parsing library.

require 'csv'
CSV.parse(File.read(file), headers: true, encoding: "UTF-8").each_with_index do |row, idx|
    next if idx == 0  
    data << row
    # access data by  row["Title"]
end

Next, construct an HTML fragment as a string, in a simplest form. For this exercise, I did not use any HTML generation library, just pure string concatenation.

Still in TestWise, create one RSpec test script to does the generation.

Run the ‘test’ script. The generate HTML file looks like this.

There is no styling currently, which is fine, as it will be provided by the published site.

Upload to SiteWise via API

All my father’s websites on SiteWise CMS (content management system) that my father created. A placeholder for “Substack Article” page has already been created.

I just need to invoke SiteWise API to set the content for this page. It is really just a typical REST POST API testing, very easy to do in Ruby.

require "rest-client"

  def post_sitewise_update(slug, content, parent_page_slug = "/")
    url_params = "api_key=hidden123&page_full_path=#{slug}&parent_page_full_path=#{parent_page_slug}"
    url = "https://agileway.com.au/api/update_page?" + url_params

    resp = RestClient.post(url, content, headers = { "Content-Type" => "text/plain" })
    puts resp.inspect # => if code is 2xx, means OK
    puts resp.body
  end

Create another test case (in the same script file) for publishing the article HTML.

Run it in TestWise. It failed with error stack trace.

<RestClient::Response 200 "{:error=>\"e..."> 
{:error=>"error occurred on update page: Mysql2::Error: Incorrect string value: '\\xF0\\x9F\\x91\\x8E\\xF0\\x9F...' for column 'body' at row 1: UPDATE `page_parts`

This is related to emoji Unicode.

html_published = html_published.gsub(/[^[:alnum:][:blank:][:punct:]]/, '').squeeze(' ').strip

Add the above statement to remove those emojis. Rerun the test.

The Result

The web page: https://agileway.com.au/substack-articles

Readers should find it easy to locate a specific article of interest.

Even my father (the main author) found it helpful. For instance, he quickly spotted a duplicate.

The Agile Way

Discussion about this post