One of the many pleasures of being a developer is that we're able to solve problems–no matter how obscure–through code. We can automate day-to-day tasks that might drive others mad. Our solutions may range from something as small as a bash script, to as elaborate as an entire application. I think nearly every developer is guilty of automation. One common task in this category of coding is screen scraping–the act of obtaining data from the web by parsing raw data, typically markup. Follow along as we build the skeleton for a simple screen scrape CLI in Go capable of notifying you whenever new content appears on your favorite web comics.

Keeping a Record

In order to be notified when new content appears, we have to know what content is actually new. There's a number of ways this could be accomplished, but since we're dealing with a stateless CLI it makes the most sense to record our findings in a database. Once again I'll turn to gorp for managing the Record model. The two notable fields on the model are Source–which records the source of the web comic content, which is typically a page with chapter listings–and Latest–which records the most recent known chapter for the given web comic source. Both fields are stored in the form of a URL. Each source of content is unique and will have at most one entry in the records table.

import (
	"github.com/coopernurse/gorp"
	"time"
)

type Record struct {
	Id       int64     `db:"id"`
	Source   string    `db:"source"`
	Latest   string    `db:"latest"`
	Modified time.Time `db:"modified"`
}

func (p *Record) PreInsert(s gorp.SqlExecutor) error {
	p.Modified = time.Now().UTC()
	return nil
}

func (p *Record) PreUpdate(s gorp.SqlExecutor) error {
	p.Modified = time.Now()
	return nil
}

A Screen Scraping Task

The way that content is examined will likely differ from source to source, so its important to have a flexible interface that allows for this variance. However, the findings of any given task need to be presented in a standard form thats compatible with the Record type. Using a simple task interface like the following, and a standard TaskResult struct type, we can develop more specialized tasks that deal with each source.

// TaskResult represents the results from a Task.
type TaskResult struct {
	Source *url.URL
	Result *url.URL
	Error  error
}

// Task defines the interface for all screen scrape tasks.
type Task interface {
	Run(source *url.URL)
	Result() *TaskResult
}

Continuing on with the web comic theme, here's an example task that implements the Task interface and is capable of finding new chapters from comics that appear on the fictional "Acme Web Comics" site:

// AcmeWebComics retrieves the most recent comic from Acme Web Comics.
type AcmeWebComics struct {
	source   *url.URL
	result   *url.URL
	runError error
}

func (t *AcmeWebComics) Run(source *url.URL) {
	t.source = source

	doc, err := goquery.NewDocument(source.String())
	if err != nil {
		t.runError = err
		return
	}

	latest, exists := doc.Find(".chapter-list ul li a").First().Attr("href")
	if exists {
		latestUrl, err := url.ParseRequestURI(latest)
		if err != nil {
			t.runError = err
			return
		}
		t.result = latestUrl
	} else {
		t.runError = errors.New("Unable to find element that matches selector.")
	}
}

func (t *AcmeWebComics) Result() *TaskResult {
	return &TaskResult{
		Source: t.source,
		Result: t.result,
		Error:  t.runError,
	}
}

The Run method of AcmeWebComics is using the wonderful jquery inspired library goquery, and works by requesting the content of the provided source, and looking for an element matching the given CSS selector (.chapter-list ul li a). Should the task find its target element, it updates the task's result field with the URL found in the href attribute. If its unable to find what its after or runs into an error along the way, the task populates the runError field with the encountered error.

Our Very Own Flag

Now that the data and basic task implementation are sorted out, we can start walking through the CLI from its entry point. The basic usage of the CLI looks like this:

go-screenscrape-template \
  -url http://www.webcomic.com/comic1/ \
  -url http://www.webcomic.com/comic2/

The CLI can accept an unlimited number of URL parameters. Though there isn't any built-in type from the flag package for handling multi-value flags, its pretty simple to create your own custom flag type. Here's the type UrlFlag that represents a slice of *url.URL:

type UrlFlag []*url.URL

func (flag *UrlFlag) String() string {
	return fmt.Sprintf("%v", *flag)
}

func (flag *UrlFlag) Set(val string) error {
	urlVal, err := url.ParseRequestURI(val)
	if err != nil {
		return err
	}

	if !urlVal.IsAbs() {
		return errors.New(fmt.Sprintf("Invalid URL '%s' - all values must be absolute.", urlVal))
	}

	*flag = append(*flag, urlVal)
	return nil
}

func (flag *UrlFlag) Get() interface{} {
	return []*url.URL(*flag)
}

The UrlFlag type satisfies the flag.Getter and flag.Value interfaces. Using the custom type is pretty straight forward:

var urlFlag UrlFlag

func init() {
	// Initialize urlFlag variable, and define url flag
	urlFlag = make(UrlFlag, 0)
	flag.Var(&urlFlag, "url", "Absolute URL to known comic source.")
}

func main() {
	//  Parse command line flags, and call screenscrape.Run() if one or more URLs
	flag.Parse()
	if len(urlFlag) > 0 {
		screenscrape.Run(urlFlag...)
	}
}

Running the Show

The screenscrape package defines a function with the signature Run(sources ...*url.URL) which handles the basic execution logic of the CLI. We'll step through the contents of this function and examine its functionality one piece at a time.

Things begin with a simple switch statement on the host of each URL. If a task type exists for that host, it gets added to the slice tasks. Unknown hosts will cause the application to panic. Next, a separate goroutine is started for each task and sync.WaitGroup provides the necessary functionality to monitor and wait on all tasks to complete.

var wg sync.WaitGroup
var tasks = make([]Task, len(sources))

// Get task by host for each URL
for i, taskUrl := range sources {
	switch taskUrl.Host {
	case "www.acmewebcomics.com":
		tasks[i] = new(AcmeWebComics)
	case "www.foobarcomics.com":
		tasks[i] = new(FoobarComics)
	default:
		log.Panicf("Unknown host %v", taskUrl.Host)
	}
}

wg.Add(len(tasks))

// run each task
for i, task := range tasks {
	go func(t Task, u *url.URL) {
		t.Run(u)
		wg.Done()
	}(task, sources[i])
}

// Wait for all tasks to complete
wg.Wait()

Iterate over each completed task,and so long as the the result doesn't include an error, add it to the results map for easy lookup later. The resultSources slice stores the source URLs as strings for the SQL lookup shown next.

// Collect successful tasks
results := map[string]*TaskResult{}
resultSources := []string{}

for _, task := range tasks {
	r := task.Result()
	if r.Error == nil {
		results[r.Source.String()] = r
		resultSources = append(resultSources, r.Source.String())
	} else {
		log.Warnf("Task for '%v' encountered err '%v'", r.Source, r.Error)
	}
}

By fetching records for each successful source URL from the database, we can compare their latest chapter with the latest chapter retrieved by the task.

var records []*Record
if _, err := dbmap.Select(&records,
	fmt.Sprintf(`SELECT r.* FROM records r WHERE r.source IN ('%s') ORDER BY r.source ASC`, strings.Join(resultSources, "', '"))); err != nil {
	log.WithFields(log.Fields{
		"error": err,
	}).Error("Error querying database")
	return
}

First we compare existing records from the database against the results map. If results shows a new chapter for a task, the chapter gets added to newLatest to be sent in the email notification later, the database record is updated, and the key is removed from the results map.

// Keep track of new chapters to be sent in the email
newLatest := []string{}

// First check existing records
for _, record := range records {
	if result, ok := results[record.Source]; ok {
		if result.Result.String() != record.Latest {
			// new latest URL, save URL and update record
			log.Infof("Found new chapter '%v'", result.Result)
			newLatest = append(newLatest, result.Result.String())
			record.Latest = result.Result.String()
			if _, err := dbmap.Update(record); err != nil {
				log.WithFields(log.Fields{
					"error":     err,
					"record_id": record.Id,
				}).Error("Error updating record")
			}
		}
	}
	delete(results, record.Source)
}

Any keys remaining in results after the previous code should be brand new sources that don't have records created in the database. Iterate over them, inserting new records and append the results to newLatest for the email notification.

for source, result := range results {
	log.Infof("Found new chapter '%v'", result.Result)
	newLatest = append(newLatest, result.Result.String())
	if err := dbmap.Insert(&Record{
		Source: source,
		Latest: result.Result.String(),
	}); err != nil {
		log.WithFields(log.Fields{
			"error":  err,
			"source": source,
			"latest": result.Result,
		}).Error("Error inserting record")
	}
}

Finally, check if we found new chapters and send an email notification.

if len(newLatest) > 0 {
	var mailContent bytes.Buffer
	ctx := struct {
		Chapters []string
	}{
		newLatest,
	}

	log.Info("Sending new chapter email")
	if err := newChaptersEmailTemplate.Execute(&mailContent, ctx); err == nil {
		if err := SendMail(mailContent.String()); err != nil {
			log.WithFields(log.Fields{
				"error": err,
			}).Error("Error sending email")
		}
	}
}

The SendMail code has been omitted here, but you can find it in the full project source.

Automation... Complete!

At this point you should be feeling pretty satisfied–you just automated your web comic addiction! Now you can use those precious minutes spent checking for content updates every day to... read more comics? See the full CLI project skeleton from this post on github and customize it to your hearts content. For more robust CLI interfaces, check out the excellent cli.go package.

comments powered by Disqus