Case Study: Automation Script to Extract the Top 10 Authors featured in the Software Testing Newsletters
To verify my father’s claim of “most featured author in the leading software testing newsletters”. The result: Yes.
A repost of my daughter’s article, included in the How to do X in Selenium WebDriver? series.
In a recent article, my father made a guess that he was “probably the most featured author in the leading software testing newsletters”. In this article, I will write an automation script to verify this claim, by extracting author names from past issues (since 2021) of the Software Testing Weekly and Coding Jag, both are widely regarded as one of the best software testing newsletters.
Table of Contents
· Analyse
∘ 1. To Extract the Author Names in the description of each article.
∘ 2. There can be more than one author within an article.
∘ 3. Narrow down the sections
∘ 4. Remove article links.
∘ 5. Filter out by exclusion words
· Execution
∘ Counting
∘ Put it all together: analyse all 98 issues over the past 2 years
∘ Charting
· My father featured count in Coding Jag
· Summary
· Full Test Script
· Zhimin’s Notes
Analyse
I start with one Software Testing Weekly issue (#153, the latest at the time).
1. To Extract the Author Names in the description of each article.
The author’s name is linked underneath the article title.
There are no identifiable attributes, e.g. <a class='author'
, for author links. This means the locating strategy is to extract the paragraph links with //p/a
XPath or similar.
2. There can be more than one author within an article.
For the above example,
There are three links here, two are for authors (Zhimin Zhan and Malith Senadheera), and one is for another related article.
3. Narrow down the sections
The whole page structure is as below.
I started with this,
driver.find_elements("//div[@class='issue__body']//p/a")
to get all links under articles.
The results have too much noise, such as the sponsored and general links. I need to narrow it down to the relevant sections, defined in an array (in Ruby).
stw_categories = %w(cc-news cc-automation cc-toools cc-books cc-videos)
Use the script below to get the links in the specified sections, then combine them.
stw_categories.each do |category|
section_links = driver.find_elements(:xpath,
"//section[@class='category #{category}']//div/p/a")
# ...
end
4. Remove article links.
I am only interested in author names, besides that, there may be other article links. To filter them out, I made a crude assumption that authors' names are less than or equal to three words long. Everything else gets filtered out.
next if the_link_text.split.size > 3
The split
method returns an array of words from a string.
5. Filter out by exclusion words
There are may be short article names or other links. So I define an exclusion list.
exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]
If the link text contains any of them, exclude it.
next if exclude_words.any?{|x| the_link_text.include?(x) }
Execution
Here is the test script for analysing one issue (#153).
it "Extract authors in Software Testing Weekly #153" do
driver.get("https://softwaretestingweekly.com/issues/153")
stw_categories = %w(cc-news cc-automation cc-toools cc-tools cc-books cc-videos)
exclude_words = ["this Reddit thread", "Test Model", "Architect of Quality"]
category_links = []
stw_categories.each do |category|
section_links = driver.find_elements(:xpath, "//section[@class='category #{category}']//div/p/a")
section_links.each do |one_link|
the_link_text = one_link.text
next if the_link_text.split.size > 3
next if exclude_words.any? { |x| the_link_text.include?(x) }
category_links << one_link
end
end
author_names = category_links.collect { |elem| elem.text }
puts "\n" + author_names.size.to_s + " in total"
end
Of course, I did not get the above in one go. I tried and worked it out step by step in TestWise, using its wonderful “debugging mode” (attaching test execution to the existing browser, no need to restart from the beginning to try out a new test step, a huge time saving and keep the momentum). So it did not take long, maybe 15 minutes (including analyse time), to get it done.
The output:
Antoine Craske
Ricardo Bedin
Alan Richardson
Daniel Lehner
Maciej Rojek
Martin Ivison
Ioan Solderea
John Ferguson Smart
Elizabeth Zagroba
Jeff Cechinel
Paul de Witt
Criss Chan
Zhimin Zhan
Lutfi Fitroh Hadi
Dan Neciu
Debojyoti Chatterjee
Zhimin Zhan
Malith Senadheera
Nikola Dimic
Mike Harris
Jennifer Columbe
John Miller
Issue #153 has 22
authors, which looks about right. Please note, 100% accuracy is not what I am aiming for, as I am only interested in the top authors.
Counting
In issue 153, my father’s name, “Zhimin Zhan”, appeared twice. We need to count authors by their number of occurrences. This is very easy to do in Ruby!
puts author_names.tally
The output:
{"Antoine Craske"=>1, "Ricardo Bedin"=>1, "Alan Richardson"=>1,
"Daniel Lehner"=>1, "Maciej Rojek"=>1, "Martin Ivison"=>1,
"Ioan Solderea"=>1, "John Ferguson Smart"=>1, "Elizabeth Zagroba"=>1,
"Jeff Cechinel"=>1, "Paul de Witt"=>1, "Criss Chan"=>1,
"Zhimin Zhan"=>2, "Lutfi Fitroh Hadi"=>1, "Dan Neciu"=>1,
"Debojyoti Chatterjee"=>1, "Malith Senadheera"=>1, "Nikola Dimic"=>1,
"Mike Harris"=>1, "Jennifer Columbe"=>1, "John Miller"=>1}
To sort by occurrences.
sorted = author_names.tally.sort_by(&:last)
The output:
[["John Miller", 1],
["Ricardo Bedin", 1],
["Alan Richardson", 1],
["Daniel Lehner", 1],
["Maciej Rojek", 1],
["Martin Ivison", 1],
["Ioan Solderea", 1],
["John Ferguson Smart", 1],
["Elizabeth Zagroba", 1],
["Jeff Cechinel", 1],
["Paul de Witt", 1],
["Criss Chan", 1],
["Antoine Craske", 1],
["Lutfi Fitroh Hadi", 1],
["Dan Neciu", 1],
["Debojyoti Chatterjee", 1],
["Malith Senadheera", 1],
["Nikola Dimic", 1],
["Mike Harris", 1],
["Jennifer Columbe", 1],
["Zhimin Zhan", 2]]
To sort from high to low order, reverse with: sorted.reverse!
, to get
[["Zhimin Zhan", 2],
["Jennifer Columbe", 1],
...
]
To get the top 10.
top_10 = sorted[..9]
Zhimin: it is so easy and intuitive with Ruby, isn’t it?
Put it all together: analyse all 98 issues over the past 2 years
My father started blogging on January, 27, 2021. The issue for that time is #56. So, I add the looping to analyse these 98 issues.
author_names = []
(56..153).each do |issue_no|
puts "Issue: #{issue_no}"
driver.get("https://softwaretestingweekly.com/issues/#{issue_no}")
# ... see above to extract one
# ...
author_names << the_link_text
sleep 1 # don't hit the server too hard
end
Note: I added a sleep of 1 second in between loading each issue to prevent spamming the server too much.
The result:
[
["Dennis Martinez", 40],
["Zhimin Zhan", 37],
["Antoine Craske", 37],
["Maaret Pyh\u00E4j\u00E4rvi", 30],
["Gleb Bahmutov", 28],
["Pramod Dutta", 24],
["Michael Bolton", 18],
["Mike Harris", 18],
["Callum Akehurst-Ryan", 16],
["Gil Zilberfeld", 16]
]
So, my father is the second, not the top one.
I quickly checked a few issues and found a high percentage of articles by “Dennis Martinez” and “Antoine Craske” are under the “News” category, which probably won’t fit in the scope of the claim. If I excluded that,`stw_categories = %w(cc-automation cc-toools cc-tools cc-books cc-videos)`
, the result would be:
[
["Zhimin Zhan", 32],
["Gleb Bahmutov", 28],
["Dennis Martinez", 27],
["Pramod Dutta", 24],
["Gil Zilberfeld", 14],
["Oleksandr Romanov", 12],
["Filip Hric", 12],
["Paul Grizzaffi", 11],
["Marie Drake", 11],
["NaveenKumar Namachivayam", 11]
]
On this measure, my father is the top. Anyway, to be 100% neutral, I will go with the first result (including News, where my father ranked №.2) for Software Testing Weekly.
Charting
My father’s featured count in Coding Jag
I also tried to create an automation script to do the same for another leading software testing newsletter: Coding Jag. However, author names are not shown in Coding Jag.
So, extracting all authors for comparison is not possible, but I can count the total number of my father’s articles featured there, from Issue 22 to 125 (the same period).
The script below is the main logic for counting unique articles containing`zhiminzhan`
(of his blog URL:
https://zhiminzhan.medium.com
) in the article links. The full script is listed in a later section.
links = driver.find_elements(:tag_name, "a")
link_texts = links.collect { |x| x["href"] }
zhimin_links = link_texts.compact.select { |y| y.include?("zhiminzhan") }.uniq
zhimin_total_count += zhimin_links.count
The results:
Total number of articles by Zhimin Zhan on Coding Jag: 60
Summary
Coding Jag featured my father’s articles more than Software Testing weekly, 62%
more for the same period.
With all the info above, my father’s claim is mostly correct.
Full Test Script
Keep reading with a 7-day free trial
Subscribe to The Agile Way to keep reading this post and get 7 days of free access to the full post archives.