At Altitude, we have been using Umbraco CMS for quite a while now and it has been a great experience to develop stunning Umbraco driven websites in a short span of time!
In the past, we have used Umbraco Examine for full text searches. Umbraco Examine is an implementation of Lucene.Net that uses Umbraco as the data source for its indexes. Examine makes it super easy to work with Lucene indexes.
Recently we implemented an Examine.PDF search feature within one of our Umbraco-based web applications. The requirement was to search specific words within the .pdf files residing in the Umbraco Media section and list the nodes which have those .pdf files attached, based on relevance. Initially the task sounded easy. We quickly wrote a small .net application using the UmbracoExamine.PDF nuget package as a proof of concept. The result of this application was reasonably satisfactory as it seemed to address our client’s high-level requirements. We then estimated the PDF search story and started the development.
After coding the pdf search feature and performing the high level functional testing, we started testing various real world scenarios with real data loaded into our database. During this testing phase we observed that the resultset (of Umbraco nodes) returned by the Examine.PDF searcher was not consistent. The scores of the resultant nodes were changing intermittently due to some unknown factors.
For example, when we searched for the word “overweight” across all the pdf files residing in our Umbraco media section, we were expecting to get a list of pdf nodes containing that specific word. We were also expecting the scores assigned to the resultant nodes to be based on relevance, meaning that if the word “overweight” appears 10 times in a pdf, the pdf (node) should have a higher score than a pdf (node) which has only 7 occurrences. However, for some reason this was not case. A 150 KB pdf with 5 occurrences of the search term had a higher score than a 300 KB pdf with 10 occurrences. We noticed the scores of some nodes changed when we added new pdf files to the Media section, or updated the contents of existing pdf files. We were quite confused by this and so we decided to take a deeper dive into the underlying Lucene indexing and its scoring mechanisms to understand it better.
Lucene (Apache Lucene) is a high-performance, full-featured, text search engine library which can be used in any application that requires a full-text search feature. We started investigating the Lucene scoring mechanism and came across some documentation (http://www.lucenetutorial.com/advanced-topics/scoring.html) which gave a fair idea about what all the factors of the score are, why the resultant node scores were not consistent for our various test scenarios, and the ways to customise the scoring logic.
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a user's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification. Lucene also adds some further capability and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart.
The factors involved in Lucene's scoring algorithm are as follows:
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more
5. queryNorm= normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time
The implication and rationales of factors 1, 2, 3 and 4 are shown within the result, if the other factors are explicitly specified.
After considering the factors mentioned above, and the test data present in our database at that point in time, we started understanding the rationale behind the node scores. Most of the scores closely matched to the effect of the above-mentioned factors. Of course, we are still not very sure about the scores of some nodes for specific search scenarios, but we understand that Lucene adds some additional functionality and refinements to the search results, which can affect the score. Also, the scoring is very much dependent on the way documents are indexed in the first place.
Implementation of Examine.PDF gave us an opportunity to understand Lucene indexing, its scoring mechanism, and allowed us to play with core Lucene queries to boost the node score depending on business requirements.
We documented the factors which affect the PDF search result score, and communicated them to our business and client teams to manage their expectations on what to expect out of Lucene search.
A more in-depth analysis of Lucene’s indexing/scoring is coming soon.