Creative Web Automation: Generate Ad-Free Study Notes
How to use an automated script to grab only the text content off online study notes
This is included in the “How to in Selenium WebDriver” series.
For my high-school English book studies, I occasionally referred to online study notes, such as SparkNotes. SparkNotes (and others) provided the content free but split it into multiple small sections with commercial ads. I could live with it; just a lot of pages to click through. My father saw it and created clean combined versions of just the text on these sites using automation scripts.
This article documents how I repeated his work, using raw Selenium in Ruby scripts.
Table of Contents:
· Target Website
∘ Main Page (for Macbeth)
∘ A Section Page:
· Automation Design
· Implementation
∘ 1. Save note pages
∘ 2. Parse the HTML to extract the content
∘ 3. Filter out unneeded links.
∘ 4. Combine all section files.
∘ 5. Optimize and Print out
· Complete Scripts
∘ Script 1 — Save the notes (multiple) to separate HTML files
∘ Script 2 — Parse and generate the clean HTML version
· Review (by Zhimin)
∘ Tip 1: Flexible to use the best tool for the job
∘ Tip 2: Automation Script ≠ Automated Test Script
∘ Tip 3: Automation can help you more than you think
Target Website
Main Page (for Macbeth)
I’ll use the play Macbeth on Sparknotes as an example.
Each Act ?, scenes ???
link goes to a separate summary and analysis page.
A Section Page:
As you can see, there is quite a number of ads, distractions (and even popups) on the site. My purpose is to create a clean version of just the text.
Automation Design
Navigate to the main Study Notes Page, and extract the links starting with
Act
into an array.Define a
pure_content_html
string to hold the contentIterate over the link array, and navigate to each page
Extract the notes part from the HTML source, and append it to the
pure_content_html
.Generate a new HTML file with the extracted
pure_content_html
Generate a PDF document for printing.
(this step can be done manually)
Implementation
URL: https://www.sparknotes.com/shakespeare/macbeth/
1. Save note pages
First, I used Selenium Webdriver to navigate to the book’s main page and collect all of the links for each section.
links = driver.find_elements(:xpath, "//div[@data-jump='summary']//li/a[@class='landing-page__umbrella__link' and starts-with(text(), 'Act')]")section_links = links.collect { |x| x["href"] }
Then, for each section’s link, visit that page and download the HTML:
the_html = driver.page_source
puts the_html.size # verify HTML was obtained
File.open("tmp.html", "w").write(the_html)
There will be several files created: section-0.html
, section-1.html
, …, section-7.html
.
2. Parse the HTML to extract the content
I had some issues getting this to run as part of the Selenium script, it could be done, but not efficiently. So instead, I used a plain Ruby file to open the downloaded HTML files and parse the content with Nokogiri. It is also more convenient as I do not have to open the browser and download the files each time.
I will focus on the first file section-0.html
(all eight files have the same structure, i.e, technically the same from our parsing point of view).
# use Nokogiri to parse HTML
require 'nokogiri'the_html = File.read("/tmp/section-0.html")
doc = Nokogiri::HTML(the_html)
In the screenshot below, the main content we are interested in (Act, Scene numbers & the summary) are under the div
with the class mainTextContent
and the <h3>
and <p>
tags. So we will use Nokogiri to only extract these elements.
the_pure_content_html = "<html><body>"
elem_main_content = doc.xpath("//div[contains(@class, 'mainTextContent')]")# only keep h3 and p tagged elements
elem_main_content.children.each_with_index do |x, idx|
if x.name == "h3"
the_pure_content_html += ("\n<br/><h3>" + x + "</h3>\n")
elsif x.name == "p"
the_pure_content_html += ("\n<p>" + x + "</p>\n")
end
endthe_pure_content_html += "\n</body></html>"
3. Filter out unneeded links.
There are some additional <p>
tags that we don’t want to keep in the clean version, such as links like the below:
We can filter these links out.
# revised filter
if x.name == "h3"
the_pure_content_html += ("\n<br/><h3>" + x + "</h3>\n")
elsif x.name == "p"
if x.to_s.include?("<p><a href") || x.to_s.include?("<p><span")
# skip additional links
else
the_pure_content_html += ("\n<p>" + x + "</p>\n")
end
end
4. Combine all section files.
After confirming the first section is done properly, we can simply put it into a loop to process all the section files together.
the_pure_content_html = "<html><body>"
8.times do |idx|
doc = Nokogiri::HTML(File.read("/tmp/section-#{idx}.html")
# ...
the_pure_content_html += ... # see above
end
the_pure_content_html += "\n</body></html>"
the_pure_content_html
now only has all the clean content (Section headings and summary). Save it into an HTML file.
File.open("/tmp/clean-ver.html", "w").write(the_pure_content_html)
5. Optimize and Print out
Just open the /tmp/clean-ver.html
in a browser, and print it to PDF. The output will be like this:
It’s clean — no distracting advertisements or popups! We can improve further by adding inline styling.
the_pure_content_html = "<html><head>
<style>
body {background-color: #FFF;
font-family: Verdana, Helvetica, Arial;
font-size: 14px; }
h3 {font-size: 15px; color: blue;}
</style>
</head><body>\n"
# ...
Complete Scripts
Script 1 — Save the notes (multiple) to separate HTML files
Using Raw Selenium WebDriver
it "Download Macbeth Sparknotes" do
driver.get("https://www.sparknotes.com/shakespeare/macbeth")
# Main Page, get all section links
links = driver.find_elements(:xpath, "//div[@data-jump='summary']//li/a[@class='landing-page__umbrella__link' and starts-with(text(), 'Act')]")
section_links = links.collect { |x| x["href"] }
section_links.each_with_index do |current_section, idx|
driver.get(current_section)
the_html = driver.page_source # get page html
File.open("/tmp/section-#{idx}.html", "w").write(the_html)
end
end
Script 2 — Parse and generate the clean HTML version
Using Nokogiri ruby gem to parse the HTML.
require 'nokogiri'
the_pure_content_html = "<html><head>
<style>
body {background-color: #FFF;
font-family: Verdana, Helvetica, Arial;
font-size: 14px;}
h3 {font-size: 15px; color: blue;}
</style>
</head><body>\n"
8.times do |idx|
the_html = File.read("/tmp/section-#{idx}.html")
doc = Nokogiri::HTML(the_html)
elem_main_content = doc.xpath("//div[contains(@class, 'mainTextContent')]")
elem_main_content.children.each_with_index do |x, idx|
if x.name == "h3"
the_pure_content_html += ("\n<br/><h3>" + x + "</h3>\n")
elsif x.name == "p"
if x.to_s.include?("<p><a href") || x.to_s.include?("<p><span")
# skip
else
the_pure_content_html += ("\n<p>" + x + "</p>\n")
end
end
end
end
the_pure_content_html += "\n</body></html>"
File.open("/tmp/clean-ver.html", "w").write(the_pure_content_html)
Review (by Zhimin)
First of all, the purpose of this automation is not for commercial reasons but rather as a fun automation exercise. If you like and use the content of a website, pay for the premium privileges (by the way, the page structure changes often, and the script will become obsolete quickly)
Tip 1: Flexible to use the best tool for the job
Courtney used Selenium only in Automation Script #1 for just saving all section pages in separate files. A common question that automated testers might ask: “Why don’t use Selenium to extract the heading and notes in each section page?”
Yes, it could be done. But in this case, that’s not the best method, as it would be slow and less reliable (failed on one section, need to restart from the beginning)
Ruby’s Nokogiri
, IMO, is the best XML/HTML parsing library (several Java programmers told me so after using it).
Tip 2: Automation Script ≠ Automated Test Script
This is not an automated test script, just an automation script. Automated test scripts, in my opinion, requires a lot more effort. For example, I would NOT refactor this script to conform Maintainable Automated Test Design, as it is often unnecessary.
The reliability of automation scripts, on the other hand, needs to be high compared to automated test scripts. If one automation script fails, quite commonly, it is OK to run one or two times more. However, automated tests in real agile projects mean those would be run in a CI/CD (or CT) process. One test failed = the whole build failed.
Tip 3: Automation can help you more than you think
Automation has wide use in software development, not just QA. I have used automation for various tasks (development, building, development, test data preparation, and of course, Testing) in my career. Check out this article for some example uses.
Sadly, many software engineers didn’t realize that. The main reason is that these skills are not taught at universities.
If you are a software engineer (or SET), assess the tasks you perform every day, and you might be surprised how much you can improve with automation. Even lacking automation scripting skills, you can start with the thought, and then learn it. It will make your work much more fun because you will feel creative and productive.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission.
Related reading: