Create an Automated Site Spell Checker

To create our spell checker we’re going to need a script that performs the following tasks.

1. Get a list of all the URLs used on site
2. View content of the pages (without all those HTML tags)
3. and lastly check the spelling

Now to get a list of the URLs we’re going to use a program called linklint, this Perl program is used to check the links of a site.  While checking the links of a site it also creates a file with all the URLs for the site. After you’ve installed the program you should run it a command similar to the one below but replacing ‘www.example.com’ with your own site and ‘example_results’ with the directory you want the results saved to.


linklint -http -host http://www.example.com -limit 1000 -doc example_results /@

This will check every link on your site(limited to 1000) and generate several files with the results.  The important file is httpok.txt, it contains a list of all internal URLs for the site (well all the ones that weren’t broken).

So that’s step 1 complete, now we could certainly use the list of URLs and wget to grab the source for each of the URLs but then we have to use some clever HTML parser to start stripping out the HTML.  An easier way is to use the text web browser Lynx.  You can view a web page inside the shell using something like.


lynx http://www.example.com

Now to save the page rather than browsing it use the -dump parameter e.g.


lynx -dump http://www.example.com > example_index.txt

This lets you output the page as plain text.  So now it is a relatively simple task to combine the results from Linklint with lynx to general a text file for every page of your site without any html.

The final step is check the spelling of each of the files.  For that you can just use something like spell.


spell example_index.txt

Well those are the basic steps needed to create a site spell checker, you might want to use those three programs to right a script in the programming language of your choice.  Alternatively continue with the tutorial to see how to create Ruby script to perform the spell checking.

Below is a very simple version written in ruby for you to expand on.


require 'fileutils'

# Get the files to use in the spelling checker
site_url = 'www.example.com'
results = File.join(ENV["HOME"],'example_results')

linklint_results = File.join(results,'linklint_results')
lynx_results = File.join(results,'lynx_results')

# Create a directory for the results if they don't exist
FileUtils.mkdir_p(lynx_results)
FileUtils.mkdir_p(linklint_results)

# Run linklint for the site
puts "linklint -http -host #http://theinbetweens.co.uk/ -limit 1000 -doc #{linklint_results} /@"
system "linklint -http -host #http://theinbetweens.co.uk/ -limit 1000 -doc #{linklint_results} /@"

# Create an array of all the URLs on the site
site_urls = []
f = File.open(File.join(linklint_results,'httpok.txt'),'r')
f.each_line do |line|
  
  # Find all the lines that start wih a / and don't include a periods(.)
  if line =~ /^\s+\/.*/ && !(line =~ /\./)
    site_urls << line.gsub(/\s/,'').chomp
  end
end

# Save URL as a static webpage
count = 0
results = []
site_urls.each do |x|
  output_path = File.join(lynx_results,count.to_s+'.txt')
  system "lynx -dump http://#http://theinbetweens.co.uk/#{x} > #{output_path}"
  count += 1
end  

safe_words = %w{http localhost com www co uk}

count.times do |x|
  output_path = File.join(lynx_results,x.to_s+'.txt')
  spell_dump = `spell -b #{output_path}`
  error_words = spell_dump.split("\n").uniq
  error_words = (error_words - safe_words)
  puts error_words.inspect 
end

I haven’t checked it on Windows yet but I’ll try and do that soon and make any corrections.

You don’t need to use it just for spell checking either.  Companies are general quite particular about the case and spacing of their name, a quick search through the results will let you check that companies name is correct case etc.