internal-page-background-header.png

Enterprise Search is critical for Information Management

Continuing the theme of projects that can change the way your company works, in this post we'll look at the topic of Information Findability.

Information Findability is determined primarily by two factors:

  1. The quality of information layout in categories, folders, cabinets, and similar structures; and
  2. The capability of the search engine.

The quality of the user interface could be a third factor, or you can see it as part of the information layout. What's behind the first factor is the intuitiveness of navigation, or, more simply stated - how obvious is it that the information you seek will be here rather than there? The problem with information layout is that it is virtually impossible for one structure to suit all needs. Take a trivial example such as a PO to procure a piece of test equipment for a project. Where does the PO document belong? In Finance or in R&D? In the specific project folder? The correct answer is: in all of them.

Increasingly, modern document management systems are realising the value of the "virtual folder". This presents a list of documents that is relevant to where the user is now in their navigation. The list is produced dynamically (from tags) and is most easily displayed with web-based technology where pages are commonly dynamic in nature anyway. This is what end-users expect from Web 2.0 systems and don't get from network file shares. There is a trade-off here because users don't want *too much* dynamism, and will expect repeatable results when they navigate to the same place. It has to be both familiar and contain the relevant documents. A case of "don't make me think" in action.

The capability of the search engine is down to three attributes: Faster, Deeper, and Wider.

Let's start with fastness. Search engines work by converting all file formats they can read into a single format called an index. It gathers together unstructured data from many diverse file formats, and provides a very rapid retrieval of search results compared to, for example, a lookup in an SQL database. The data stored is trimmed of common, stem, or stop words, to make it more efficient.

The process of creating the index is done by crawling the content, usually automated to start at a specified time or time intervals. It could also be event-driven i.e. add to index every time new content is added. The methods differ in the system resources required and the 'right' answer depends on how real-time the data needs to be in order to be valuable. Most document systems incrementally update the index every 15 minutes or so because that balances a reasonable number of documents being added against the performance impact of the indexing process.

Let's move to deepness. A deep search is one where the most statistically relevant result is obtained. Imagine a system that only supported search by a limited data model (such as a fixed taxonomy or ontology). It's unlikely to be useful. At a minimum, we should be looking for full-text search capabilities. It must not be the case that an end-user has to manually tag content with metadata in order for it to be found. If a document contains the relevant words, it should  be found. Whether it is displayed to a user may depend on the user's right to see the document ("security trimming"), but it should be found. Another important factor for depth is having flexibility in what files types can be indexed. The more file type formats that can be indexed, the better. It's also essential to be able to add custom formats.

Finally, there is wideness. A wide search is one where you are not limited to a single source (or information silo) but instead content is indexed from a set of sources such as other content repositories, business systems, and network file shares. This is often called federated search. The value is fairly obvious, for example as well as retrieving product part numbers from one system, it may be beneficial to combine that with order information from an ERP, or fault reports from a Help Desk system. The problem with combining data from heterogeneous sources is that success depends on data formats and the feasibility of a search API. Every company has a different set of business systems, which doesn't help either. Typically, your mileage with the wideness factor is related to the quality of the systems integration / search consultants that you use; and with the openness or otherwise of the tools that they use.

A software product such as CogniDox can significantly impact fastness and deepness, and have some impact on wideness.

One approach is to treat Search as an add-on, and integrate with leading proprietary software search platforms. But, leading proprietary products such as FAST, Autonomy, and Endeca have been acquired and merged into product lines, making the situation uncertain for their standalone customers. And, as the influence of these proprietary solutions diminished, the Apache Solr® open source solution has grown in strength as a powerful, scalable, cross-platform search engine (https://lucene.apache.org/solr/).

Therefore, CogniDox provides built-in search powered by the Solr engine. Solr has a rich set of features such as faceted search, full text search, rich document handling and dynamic clustering. Out-of-the-box, it provides indexed search for CogniDox documents (including full text search). The Apache Tika project, which is commonly used alongside Solr, has an extensive list of supported Document Formats (http://tika.apache.org/1.5/formats.html).

It basically provides us with fastness and deepness; and by virtue of the fact that it is open source, it leaves the way clear for any wideness initiative.

Value of a DMS for product development

Tags: Document Control, Enterprise Search, Findability