Case Study: Automation Script to Extract the Top 10 Authors featured in the Software Testing Newsletters

To verify my father’s claim of “most featured author in the leading software testing newsletters”. The result: Yes.

Courtney Zhan

and

Zhimin Zhan

Feb 11, 2023

∙ Paid

A repost of my daughter’s article, included in the How to do X in Selenium WebDriver? series.

In a recent article, my father made a guess that he was “probably the most featured author in the leading software testing newsletters”. In this article, I will write an automation script to verify this claim, by extracting author names from past issues (since 2021) of the Software Testing Weekly and Coding Jag, both are widely regarded as one of the best software testing newsletters.

Table of Contents
· Analyse
∘ 1. To Extract the Author Names in the description of each article.
∘ 2. There can be more than one author within an article.
∘ 3. Narrow down the sections
∘ 4. Remove article links.
∘ 5. Filter out by exclusion words
· Execution
∘ Counting
∘ Put it all together: analyse all 98 issues over the past 2 years
∘ Charting
· My father featured count in Coding Jag
· Summary
· Full Test Script
· Zhimin’s Notes

Analyse

I start with one Software Testing Weekly issue (#153, the latest at the time).

1. To Extract the Author Names in the description of each article.

from software testing weekly issue #153 (2023–01–29)

The author’s name is linked underneath the article title.

There are no identifiable attributes, e.g. <a class='author' , for author links. This means the locating strategy is to extract the paragraph links with //p/a XPath or similar.

2. There can be more than one author within an article.

For the above example,

There are three links here, two are for authors (Zhimin Zhan and Malith Senadheera), and one is for another related article.

3. Narrow down the sections

The whole page structure is as below.

I started with this,

driver.find_elements("//div[@class='issue__body']//p/a")

to get all links under articles.

The results have too much noise, such as the sponsored and general links. I need to narrow it down to the relevant sections, defined in an array (in Ruby).

stw_categories = %w(cc-news cc-automation cc-toools cc-books cc-videos)

Use the script below to get the links in the specified sections, then combine them.

 stw_categories.each do |category| 
   section_links = driver.find_elements(:xpath, 
     "//section[@class='category #{category}']//div/p/a")
   # ...
end

4. Remove article links.

I am only interested in author names, besides that, there may be other article links. To filter them out, I made a crude assumption that authors' names are less than or equal to three words long. Everything else gets filtered out.

next if the_link_text.split.size > 3

The split method returns an array of words from a string.

5. Filter out by exclusion words

There are may be short article names or other links. So I define an exclusion list.

exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]

If the link text contains any of them, exclude it.

next if exclude_words.any?{|x| the_link_text.include?(x) }

Execution

Here is the test script for analysing one issue (#153).

it "Extract authors in Software Testing Weekly #153" do
    driver.get("https://softwaretestingweekly.com/issues/153")
    stw_categories = %w(cc-news cc-automation cc-toools cc-tools cc-books cc-videos)
    exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]
    category_links = []
    stw_categories.each do |category|
        section_links = driver.find_elements(:xpath, "//section[@class='category #{category}']//div/p/a")
        section_links.each do |one_link|
            the_link_text = one_link.text
            next if the_link_text.split.size > 3
            next if exclude_words.any? { |x| the_link_text.include?(x) }
            category_links << one_link
        end
    end
    author_names = category_links.collect { |elem| elem.text }
    puts "\n" + author_names.size.to_s + " in total"
end

Of course, I did not get the above in one go. I tried and worked it out step by step in TestWise, using its wonderful “debugging mode” (attaching test execution to the existing browser, no need to restart from the beginning to try out a new test step, a huge time saving and keep the momentum). So it did not take long, maybe 15 minutes (including analyse time), to get it done.

Running one test step in TestWise debugging mode.

The output:

Antoine Craske 
Ricardo Bedin 
Alan Richardson 
Daniel Lehner 
Maciej Rojek 
Martin Ivison 
Ioan Solderea 
John Ferguson Smart 
Elizabeth Zagroba 
Jeff Cechinel 
Paul de Witt 
Criss Chan 
Zhimin Zhan 
Lutfi Fitroh Hadi 
Dan Neciu 
Debojyoti Chatterjee 
Zhimin Zhan 
Malith Senadheera 
Nikola Dimic 
Mike Harris 
Jennifer Columbe 
John Miller

Issue #153 has 22 authors, which looks about right. Please note, 100% accuracy is not what I am aiming for, as I am only interested in the top authors.

Counting

In issue 153, my father’s name, “Zhimin Zhan”, appeared twice. We need to count authors by their number of occurrences. This is very easy to do in Ruby!

   puts author_names.tally

The output:

{"Antoine Craske"=>1, "Ricardo Bedin"=>1, "Alan Richardson"=>1, 
  "Daniel Lehner"=>1, "Maciej Rojek"=>1, "Martin Ivison"=>1, 
  "Ioan Solderea"=>1, "John Ferguson Smart"=>1, "Elizabeth Zagroba"=>1,
  "Jeff Cechinel"=>1, "Paul de Witt"=>1, "Criss Chan"=>1, 
  "Zhimin Zhan"=>2, "Lutfi Fitroh Hadi"=>1, "Dan Neciu"=>1, 
  "Debojyoti Chatterjee"=>1, "Malith Senadheera"=>1, "Nikola Dimic"=>1,
  "Mike Harris"=>1, "Jennifer Columbe"=>1, "John Miller"=>1}

To sort by occurrences.

sorted = author_names.tally.sort_by(&:last)

The output:

[["John Miller", 1],
 ["Ricardo Bedin", 1],
 ["Alan Richardson", 1],
 ["Daniel Lehner", 1],
 ["Maciej Rojek", 1],
 ["Martin Ivison", 1],
 ["Ioan Solderea", 1],
 ["John Ferguson Smart", 1],
 ["Elizabeth Zagroba", 1],
 ["Jeff Cechinel", 1],
 ["Paul de Witt", 1],
 ["Criss Chan", 1],
 ["Antoine Craske", 1],
 ["Lutfi Fitroh Hadi", 1],
 ["Dan Neciu", 1],
 ["Debojyoti Chatterjee", 1],
 ["Malith Senadheera", 1],
 ["Nikola Dimic", 1],
 ["Mike Harris", 1],
 ["Jennifer Columbe", 1],
 ["Zhimin Zhan", 2]]

To sort from high to low order, reverse with: sorted.reverse! , to get

[["Zhimin Zhan", 2],
 ["Jennifer Columbe", 1],
 ...
]

To get the top 10.

 top_10 = sorted[..9]

Zhimin: it is so easy and intuitive with Ruby, isn’t it?

Put it all together: analyse all 98 issues over the past 2 years

My father started blogging on January, 27, 2021. The issue for that time is #56. So, I add the looping to analyse these 98 issues.

 author_names = []
 (56..153).each do |issue_no|
   puts "Issue: #{issue_no}"
   driver.get("https://softwaretestingweekly.com/issues/#{issue_no}")
   
   # ... see above to extract one
   # ...
   author_names << the_link_text
   
   sleep 1 # don't hit the server too hard
  end

Note: I added a sleep of 1 second in between loading each issue to prevent spamming the server too much.

The result:

[
 ["Dennis Martinez", 40], 
 ["Zhimin Zhan", 37], 
 ["Antoine Craske", 37], 
 ["Maaret Pyh\u00E4j\u00E4rvi", 30], 
 ["Gleb Bahmutov", 28], 
 ["Pramod Dutta", 24], 
 ["Michael Bolton", 18], 
 ["Mike Harris", 18], 
 ["Callum Akehurst-Ryan", 16], 
 ["Gil Zilberfeld", 16]
]

So, my father is the second, not the top one.

I quickly checked a few issues and found a high percentage of articles by “Dennis Martinez” and “Antoine Craske” are under the “News” category, which probably won’t fit in the scope of the claim. If I excluded that,`stw_categories = %w(cc-automation cc-toools cc-tools cc-books cc-videos)` , the result would be:

[
  ["Zhimin Zhan", 32],
  ["Gleb Bahmutov", 28],
  ["Dennis Martinez", 27],
  ["Pramod Dutta", 24],
  ["Gil Zilberfeld", 14],
  ["Oleksandr Romanov", 12],
  ["Filip Hric", 12],
  ["Paul Grizzaffi", 11],
  ["Marie Drake", 11],
  ["NaveenKumar Namachivayam", 11]
]

On this measure, my father is the top. Anyway, to be 100% neutral, I will go with the first result (including News, where my father ranked №.2) for Software Testing Weekly.

Charting

My father’s featured count in Coding Jag

I also tried to create an automation script to do the same for another leading software testing newsletter: Coding Jag. However, author names are not shown in Coding Jag.

So, extracting all authors for comparison is not possible, but I can count the total number of my father’s articles featured there, from Issue 22 to 125 (the same period).

The script below is the main logic for counting unique articles containing`zhiminzhan` (of his blog URL:

https://zhiminzhan.medium.com

) in the article links. The full script is listed in a later section.

 links = driver.find_elements(:tag_name, "a")
 link_texts = links.collect { |x| x["href"] }
 zhimin_links = link_texts.compact.select { |y| y.include?("zhiminzhan") }.uniq
 zhimin_total_count += zhimin_links.count

The results:

Total number of articles by Zhimin Zhan on Coding Jag: 60

Summary

Coding Jag featured my father’s articles more than Software Testing weekly, 62% more for the same period.

With all the info above, my father’s claim is mostly correct.

Full Test Script

Keep reading with a 7-day free trial

Subscribe to The Agile Way to keep reading this post and get 7 days of free access to the full post archives.