A few weeks ago when I heard that Amazon launched their SimpleDB service, I have to confess that Amazon wasn’t really on my radar as technology company.  Maybe the marketing hadn’t caught up with me,  but I pretty much just buy books from Amazon….web services?  Come on.  When I started poking around, I came to realize that Amazon has some really, really cool services and that I must have been under a rock.

The service that perked my interest the most though was their Simple Storage Service or (S3).  S3 is, as the name indicates, a very simple storage service that allows users to store files along with up to 2k of metadata in the Amazon cloud(buzz) storage system.  Files can be as large as 5gb and are accessible via SOAP or REST services using either HTTP or bit-torrent protocols.  It’s not a free service, but the cost is very reasonable(they provide an online calculator to estimate your cost…I think you’ll find they are very cheap.  Total development cost for this little project was 8 cents).

Great, but what does S3 have to do with Content Management?

S3 is essentially a content repository as a service.  Obviously it does not meet all the requirements for a full blown content management system, but from the perspective of the things you absolutely need in a content repository(security, metadata, storage) to the things you’d like to have(high availability, high reliability, unlimited storage) it really does make for an excellent repository. 

I also really like the concept of being able to store the metadata with the content revision, it’s certainly not a requirement or even a common practice with most CMSs, but it just adds another layer of redundancy down the road.  I’m getting a little away from the theme of the post here, but the fella over at “Me and Content Management” had a great post on this concept and I have to agree with his thoughts.

So would it be possibly to take the robust content repository features of S3 and bundle them with a content management system?  Use the content management system as the front end, managing user security, contribution, workflows, delivery…all the great things that content management systems do, but then also utilize S3 as the content repository leveraging it’s power.

Enter Oracle UCM File Store Providers

One of the neat features of Oracle UCM 10gr3(also formerly known as Stellent v8), was the new concept of File Store Providers.  Earlier versions of Stellent forced a simple yet elegant configuration where the original files were stored in a protected folder structure known as the vault and the web accessible versions which might just be a copy of the original or a multitude of transformed versions are stored in a “weblayout” folder.  This set up works very well and I am sure is still employed by most Content Server instances.

One of the problems with this design was that it did not offer many options as far as drive space, each file has at least one copy in the vault and in the weblayout, so you automatically have to double your space projections.   Probably not a huge problem for word and text content, but start using the content server for Digital Asset Management and space requirememts definitely rise on the priority list.  Enter 10gr3 with file store providers and you now have a variety of storage options.

  • Content can be stored without a web version
  • Content can be stored in the database
  • Content can be stored half in the database half on the file system

File Store providers are essentially a configurable abstraction layer from the content server to wherever you want to store your content.  The really neat thing about them though is that they appear to be a planned area of extensibility for the content server.

My S3 File Store Provider

A File Store provider for Amazon S3 seemed like such a logical enhancement to the content server that I had to give it a go.  Unlike service handlers, filters and all the other goodies in Bex’s book and the How to Components, I know of no best practices for File Store providers, so I’m a little on my own here.  Basically, if you use my project as a template, I am not sure I did it things “Oracle way” (or even if there is one).

In setting up the project I tried to figure out how Oracle configured and registered it’s JDBCStorage class, which is the primary object used in the database provider.  My set up consists of a class called SimpleStorage which implements the ExternalStorageImplementor and CommonStoreImplementor interfaces.  The File Store and classes are then registered in two tables: a mergeable table located in a component resources called FSStores and a non mergeable resultset found in the defaultFileStore’s provider.hda file(an install filter in the component handles the edit for you).  Most of the methods required by the interface are pretty straightforward; store here move there, etc…though there are handful of others that represented many hours of starting at the system audit window(addFilesystemPathInfo for one).

For the actual S3 communication I used and bundled an open source API called JetS3t.  JetS3t is an excellent API which I would highly recommend for all your Java/S3 needs.  In addition to the API, JetS3t includes a little GUI called Cockpit.  I’ve included it with the API in the component so you can use to take a peek at your S3 buckets.  You’ll find it in the lib/bin folder.  Site note: Thanks JetS3t team…your API worked great.

From a sort of overview perspective the provider works as so:

  1. A user checks in a piece of content flagged for S3 storage
  2. The vault content item is immediately stored on S3 along with up to 2k of it’s metadata(a config entry determines the fields to include)
  3. If all content is to be stored in S3, the web version is then also stored on S3, otherwise it is written to the web layout folder
  4. When the indexer attempts to access the vault file, it is retrieved and written out to the vault folder.
  5. When a user views the content information page, or tries to access the web layout rendition, the web layout version is retrieved and saved to the web layout folder
  6. A scheduled task then runs behind and deletes any vault or weblayout files which are stored on S3, but have not been accessed in a preset number of minutes.

Content can also be moved to and from S3 by preforming an Update and just changing the storage rule.

So what good is this thing?

I am pretty excited about this component because I think it’s a concept with some potential.  Couple your existing content management system with the ultimate content repository.  You get unlimited enterprise class, scalable storage at a very reasonable cost without the infrastructure and staffing requirements that go along with it.  I like to imagine future content server instances using a limited “working” storage area storing only content being actively used with terabytes of content stored out on S3.  No more SANs or the endless list of non-technology related headaches they bring.

I think we could also envision scenarios where content is managed in S3 by a content management system, but then read and accessed directly by other systems, potentially on different sides of the planet?  My knee-jerk reaction to that is that it sounds like a bad practice, but if it’s secured properly, why not?

Bandwidth and performance are probably the two things that would need to be watched in this sort of model, but I just don’t think they would prove to be much of a problem for the majority of content items.  Large files could potentially pose a performance problem, but how much more noticeable would downloading a 25mb file be if it had to come down from S3 first?  I am not sure most users would notice.

Something that may only interest me 

S3 is SOAP/REST based service…Imagine that, a protocol/service being used to retrieve content?  I hope the JSR170/283 fans take notice.

How to get it

The component is available for download here.  I should be clear that this is one of the more complex projects I’ve developed for the site and it hasn’t exactly gone through any stringent QA testing(it’s just me).  Like all my sample projects, it’s provided without warranty or support and I recommend you test it out thoroughly on a dev environment first.  My development(and testing) environment was a WIN2003 server, Oracle UCM 10GR3, Oracle 11g DB environment, but  I am fairly confident that the component will run on just about any environment that the Content Server will.  We just might need to switch out some jars.

You may have noticed from the nav, that this site now has a forums section.  If you do encounter an issue, please feel free to post it so the support forum.  I’ll get an email and maybe be able to help.

Here’s the download link:

Amazon S3 File Store Provider for Oracle UCM