http://urldiff.appspot.com/ Architectural overview UrlDiffMonitor (UDF) allows users to detect changes on web pages. The user submits an Url. UDF will regularly check if there was modified content on the page (added or removed). The user will be notified of modification through Atom Feed, Email and Google Talk instant message. UDF is an application developed on top of Google App Engine (java version).
Blocks:
Web application: it allows users to manage their URL (add new url, delete url and list them) and to obtain the Atom Feed.
Checker: it's an automated unit. It runs regularly and detect changes on remote web pages (user urls). Checker is regularly started by a Cron Service trigger. During every run it schedules check activity on the TaskQueue Service.
Feeder: it publishes the user Atom Feed.
Steps:
User submits a new URL
URL is stored
Cron Service triggers the Checker
Checker detects which URL needs to be checked (new and "elapsed" urls)
Checker enqueues check activities for URLs of the previous step (1 check task for each url)
TaskQueue service starts the check activities, one at a time.
Checker detects changes
executes a HTTP GET
gets plain text from Html (remove tags, scripts, styles, etc.)
compares old text (retrieved from DataStore) with new text (from http get)
execute the diff algorithm
If changes are detected, Checker:
stores information to DataStore (Last modify date, diff info, new text, etc.)
send email and xmpp notifications
A user's feed reader requests the feed (calls Feeder url)
Feeder builds the Atom xml feed using data from Datastore and Memcache.
Notes:
Authentication is based on Google Authentication (integrated in App Engine). Each user needs to have a valid Google account.
Memcache is used to optimize Feeder activity. Feeder builds the xml only if it's not available from Memcache. After building the feed, it stores the feed in the Memcache for further request. The cached instance is invalidated in two case: after 24H or if Checker detects a change on a Url of the user.














