In this first part of what I expect to be a few posts about Alfresco System Receiver (ASR) optimizations, I’ll talk specifically about XML Metadata Extractors for Alfresco Web Content Management (WCM). So what is an XML metadata extractor, and why should you care about it? Let’s put it in the context of a diagram:
In the diagram above we see that the WCM authoring environment is configured with web forms. This allows business users to enter content into the system; an article to be published for example. To do so, the user does not have to be skilled in web related technologies, such as HTML; they simply fill out a form with the content to be published. Once their content is entered it is saved as XML, submitted to the staging sandbox, and ultimately deployed, in this case to an ASR. The ASR is seen as being configured with an XML Metadata Extractor and a DM content model, which defines aspects that will be applied to the deployed content. So for an article content type on the authoring side, there would be an article aspect defined in the content model on the ASR. The XML metadata extractor is used to extract content from the deployed content (the article) and store is according to the aspect defined in the DM content model. As such, the content delivered via the web form can be indexed by Lucene, enabling optimized search performance on retrieval.
The problem with this approach is that the ASR is likely serving a live production web site that may have thousands (or more) visitors:
As such, it is less than ideal to have the ASR execute the processing required to extract metadata from the XML content on the ASR itself. Wouldn’t it be better if that was done on the authoring server? You bet it would. Hence the first ASR optimization; perform XML metadata extraction in the authoring environment:
One last note: for assistance setting up XML metadata extraction for WCM, see this page on the Alfresco wiki.