My blog syndicates its new contents through an Atom feed so that people can subscribe to updates. And according to the server logs, people use them (that was the initial point of having them; so, thanks!). However, it wasn’t until recently that I actually read the standard and a few other online sources, when I discovered that a few things, such as the entry identifiers, have been done wrong all this time. I’m documenting these issues publicly in order to save for future reference.
Background: Atom is not RSS
Although you will hear me vaguely referring to my feed as “the RSS”, Atom and RSS is not exactly the same thing. RSS was the first protocol that defined how should websites expose modifications, and how should clients fetch and treat those modifications. However, because RSS wasn’t extensible enough, and the standard was designed to prevent changes that could disrupt existing implementations, a working group started working on a new standard named Atom.
There are a few main differences you can see on this page that I’m not going to cover, but the notable differences for the final feed document are: Atom is an XML namespace, while RSS is not; Atom is an IETF standard, RSS is not; XML tags and attributes that have similar purposes have different names in both protocols and there are changes on what is required and what is not.
Atom feeds are XML documents and thus should be valid
Make sure when you craft an Atom feed template (such as my Jekyll template for generating the Atom feed), you produce valid documents. Validate your Atom feed using the W3C Validator (yes, there is a validator for this).
Clients following the robustness principle should easily forgive malformed feeds, but you will make things easier if you make sure that you are producing well-formed feeds in first place. Take a look at the IETF standard (is not very long). Check out which tags are valid and which ones not, and don’t use tags in a way that clients may not expect it.
Don’t ever change an ID
Atom uses the
<id> tag to identify unique entries in a feed, and on a similar
way, RSS uses the
<guid> tag for the same purpose. Feed readers and other
applications digesting feeds will use the ID for identifying each different and
unique content unit. You are not supposed to change an ID for an existing
item, because that will make clients treat the item as new, even though
users may already have been noticed about that item.
Don’t use a permalink as ID
Permalinks, as the name implies, should be permanent links. They should not change. They should not expire. You should be able to visit a permalink you bookmarked 5, 10 or 20 years ago and still see the same document you saw back then (unless the site went out of business). At least that’s what Tim Berners-Lee’s Cool URIs Don’t Change guide says.
The thing is that, even though you are not supposed to, permalinks sometimes change. Documents move. Websites gets redesigned and directory hierarchies are remade. The more information you put in the URL (author, subject, status…), the more risk it has of being changed at some point.
HTTP 301 and other redirects methods reduce the guilt (although don’t
pardon it). However, you can’t redirect an
<id> tag in a feed. So, every time
a permalink structure gets changed, the feeds will break because if the IDs or
GUIDs change, you’ll cause duplicate items to appear on readers. There are a
few ways to fix this:
- Use a non-pretty permalink as the ID, such as the primary key. This is
what WordPress does. Instead of using the pretty permalink, which is something
that could change if you do so at WordPress settings, they use the post ID
http://example.com/?p=1234. The drawback of this is that if you change your blogging software and your primary keys change, you will mess up.
- Use something that is not the permalink. Use an UUID (also known as GUID by the Microsoft people), or use any other unique value, and store it as another meta data for your content, just like the published date or the author. You can change a permalink, but there is no reason to change an UUID because it’s meaningless, it’s just a sequence of numbers and dashes.
But if you want to change your ID policy (you discovered permalinks are a bad ID and you want to set up a new system), please, keep the older IDs for already published content and use them, otherwise you will break again your feed as you are attempting to fix it! If your feed limits the number of exposed items, you could wait until those old posts are removed from the feed before deleting the legacy ID and generating the real ID for that item.
Content updates: can they be trusted?
I have used a few feed clients to test and debug my feed before uploading my site changes. If a feed reader marked an already read post as new, that would be a regression. The good thing is that I succeded on all the clients I tested on. No IDs have changed and no duplicated items appear.
However, I tested something else: how do these feed readers react whenever you
update an item? The Atom standard supports the concept of updating an item.
Items have a
<published> tag, and a
Typically, atom:published will be associated with the initial creation or first availability of the resource.
The “atom:updated” element is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant.
Not all the updates made to a post may constitute a significant change. So if I find a typo on the title of a post, that’s not a significant update. If I suddenly update a 3 month old post to reflect some steps in a tutorial have changed, that may constitute a significant change. Who knows, it’s up to the publisher to mark the change as significant.
After experimenting with four RSS clients, I’ve found that they don’t consistently reflect updates to a post. I made two experiments using four RSS clients: Thunderbird, QuiteRSS, Vienna and Shrook. Feedly was not tested because I didn’t want to access a staging/QA server from it.
- I subscribed to the feed using these four clients, and then I made changes to the title and contents of two posts. I also marked one of the amended posts as significantly updated by changing its updated timestamp. Then I rebuilt the feed and rolled the clients. Only Shrook did update the item titles, summaries and contents. The remaining clients didn’t even update the texts in the item list.
- Then, I unsubscribed from the feeds, and subscribed again to see how a new
subscriber would see an amended post. All the feed readers used the new title
and content. However, not all clients presented the same date for the item.
Thunderbird and Vienna used the
<updated>field for the date, presenting the day I amended the post. QuiteRSS and Shrook used the
<published>field for the date, presenting the day I originally published the blog post.
It’s up to feed designers to decide how should item modifications and updates be syndicated. However, after these experiments I’ve come to the conclusion that feed clients may not respect your update and reflect it on your clients lists, so you should not rely 100% on them.
Many blogging systems and other server software may provide you a well tested Atom or RSS feed already made that may be aware of this. Many feed readers, such as Feedly, may use a solid implementation after testing with thousands of feeds that detects and fixes this issues without marking posts as duplicated. However, if you are rolling out your own Atom template, you should be aware of this to avoid future errors.
I’ll make a follow-up post explaining how I rolled out this changes on my blog without causing regressions to existing items in my feed.