“Just make it work like Google” is a common refrain heard when doing requirements gathering during enterprise search strategy and implementation engagements.
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. Google is one of those products that have become a verb. When searching in Google you can type in a few letters or a few words and with few exceptions whatever you were looking for appears at the top of the page within a few hundred milliseconds.
That kind of performance has made it one of the most successful products in history and importantly it has set the bar for any application search application. The logical conclusion is that if Google can index over 50 trillion pages and handle 200 billion searches per month, indexing a hundred million documents in an enterprise should be quite a simple task. In the remainder in the article, we’ll lay out the reasons why this assumption is inaccurate, why enterprise search adds many layers of complexity and visit some architectures and strategies that can be used to handle this added complexity.
Data and Metadata curation
Google has thousands of engineers and data scientists working day in and day out to make searches faster, results more relevant, and user experience more compelling; but their efforts pale in comparison with the millions of people that are continuously optimizing their sites to make sure that their site comes out of the top of the results.
Site administrators endlessly do SEO on their sites to optimize it for the Google index, including adding titles, description, keywords, metadata and relevant URL names. In the enterprise, users often don’t know how titles, keywords and file names have any impact on the accessibility of their documents.
More and more, Google is getting good at matching quality content with the keywords, so the incentive is to continuously provide quality relevant content.
Homogeneity of content in the web
One of the reasons that Google makes things look easy is the number of web pages in its index and the number of searches it can handle simultaneously. Conservative estimates put these at 50 trillion pages and 200 billion searches per month. Most of these searches are accomplished under a few hundred milliseconds. However, many of these pages are text and more specifically HTML.
In contrast, the variability of enterprise data sources can be vast. The following diagram lists a few of the data source types that enterprise is expected to handle:
Many of today’s search offerings can handle these feeds but not without some configuration and/or customization work.
Enhancing document relevancy based on document relations
Google’s relevance algorithm relies heavily on linkages between pages. If a page has many inbound links it is considered an important page because many sources are referencing it. Google calls this a page’s PageRank. Most enterprise documents and pages are light on linkages, so the same concept is hard to implement. A metric that is used more often used is the number of times a page is visited. Also, a star system can be implemented, letting users select the number of starts that should be assigned to determine if the page was useful in answering their question or not.
Veracity and validity of the data
The use case when doing an internet search is usually not a “make or break” question. For example, if my query is “How many seats are there in Lambeau field” and the page with the answer contains the number of seats before the latest stadium renovation, I probably won’t lose my job if I get a stale answer. In an enterprise environment, a precise and fresh answer might be much more important. For example, if we are trying to submit clinical trial information to the FDA, we better be sending the most up to date information with our application. And if we don’t, that might be a billion-dollar mistake.
Locating documents across a variety of content types using faceted search
Faceted search allows users to attach categorical information to documents. A facet is a pair consisting of an attribute with a value. For instance, the facet named “time of day” might have values “morning”, “afternoon”, and “evening.”
By using faceted search, you can retrieve summary information to help you refine a query and “drill down” into your results by refining your search. Faceted search is common in shopping sites. A site user can set filters to narrow down the products that they want to see.
Faceted search is a common feature in enterprise search. It provides a flexible feature that enables users to more easily find information, ranging from simple fact retrieval to complex exploratory search and discovery-oriented problem-solving. When enhanced with keyword search, the combination becomes incredibly powerful—so much so that faceted search is now the dominant interaction paradigm in many enterprise search solutions.
Successful implementation of faceted search presents many challenges due to a variety of content types, complex taxonomies, disparate user groups, and roles, available real estate and user interface requirements.
Google consumer advanced search is very limited and not customizable. Most enterprise search use cases require an advanced and dynamic faceted search.
When people search in Google, security is non-existent. Everyone has the same access – wide open access. In the enterprise context, security is paramount. Much of the information needs to be locked down by the user’s roles. This adds a layer of technological as well as administrative complexity. And again, this is one of those issues where one mistake could be a very costly mistake.
There are a few distinct aspects to implementing a search solution:
- The deployment of the search tool and technologies
- The population of the search corpus. (adding documents to be searched)
- The curation of the data that will be searched. (adding links, metadata, and access optimization).
The first step is probably the simplest. The third step is probably the hardest and the most critical to enable relevant searches. Once the first step and second step are completed, a user will be able to search and get a complete list of relevant documents for their search, but their results will not be ranked by relevance. Only once the third step is completed will this be possible. Google has had a 20-year head start for this curation process. Imagine if all the pages on the internet were a day old. It would take quite a while for different sites to determine which pages are important and how they should link to them. In the meantime, internet search results would be complete but not sorted in order of relevancy until these connections are resolved. Relevancy sorting was the main reason that Google was able to overtake early search engine contenders like Altavista, Ask Jeeves and Yahoo.
All is not lost
We have spent the first half of the article, laying out the challenges of implementing enterprise search and why in many cases it’s more difficult than a general internet search even though the volume of documents is much less. We’ll spend the second half of the article analyzing how we can at least partially overcome some of these challenges. We will also visit strategies and architectures that will increase the likelihood of your implementation’s success.
Most of the leading enterprise search tools like Coveo, Sinequa, Lucidworks, and Microsoft offer advanced implementations of faceted search. For example, Coveo has a simple drag and drop interface that allows you to add faceted search with a few simple steps that do not involve coding.
As we mentioned in the first section, there is normally no monetary incentive for most document creators in the enterprise to optimize the documents to be searched and accessed. To combat this, many of the leading search vendors have added powerful AI capabilities to analyze user searches and navigation patterns to learn what documents should be most relevant for a given search pattern. Some examples:
Classification – Another area of search where machine learning is being applied is to assist the creation of ontologies and taxonomies. An example of this is exemplified by the query intent classifier functionality provided by Lucidworks. Some tools can generate classifications by example using supervised learning. For example, in bioinformatics, some tools can classify proteins according to their structures and/or sequences. In medicine, classification can be used to predict the type of tumor to determine if it’s harmful or not. Marketers can also use classification by example algorithms to help them predict if customers will respond to a promotional campaign by analyzing how they reacted to similar campaigns in the past.
Recommendation engines – one of the various use cases consists of merging several basic algorithms to create a recommendation engine proposing contents that might be of interest to users. This is called content-based recommendation, which offers personalized recommendations to users by matching their interest with the description and attributes of documents.
Natural Language Processing and Understanding
Additionally, most vendors also implement advanced NLP and NLU to allow for fuzzy matching searches. As an example, a search for the keyword sheep might return documents that contain the word bovine.
Entity extraction is an information extraction technique that refers to the process of identifying and classifying key elements from a document into pre-defined categories. It helps transform unstructured data into structured data. One way to think about entity extraction is the automated processing of tagging a document to make it more easily accessible. Once entities are identified, they can be used in a faceted search or as a search keyword. Many of today’s tools support entity extraction either directly or via extensions and plug-ins. Some of these extensions go beyond generic entities and are industry specific. An example is Scibite, which has entity extraction for the Life Sciences domain.
Obviously, security is a paramount concern in enterprise search implementations. All search vendors realize this and have built-in security from the ground up in their implementations. This does not mean that it a simple problem to resolve. For example, even the latest version of AWS Lake Formation does not yet support row level security. I don’t have any insights into their internal implementation, but I suspect this feature is missing due to the difficulty to implement this at scale. It is not unusual for some files in data lakes to contain hundreds of millions and even billions of rows. Implementing row level security in a file like this necessitates a table that will contain the number of rows in the files multiplied by the number of roles used by the security model used to support this file.
Department level and “Siloed” Search May Matter More
Counter-intuitively some enterprise customers have chosen not to build a single repository enterprise search, and instead, are focusing on multiple repositories based on individual use cases. Searches for information in the HR repository do not include marketing materials under this architecture. One of the underrated victories of the past few years has been improvements in intra-application and intra-department search.
As enterprises consider the benefits of bringing high-quality search into their environment, it’s important to evaluate the various factors influencing the results they seek. There’s no doubt that well-implemented search, increases employee productivity by making the right information available quickly and easily. Furthermore, relevant information can speed project or product time-to-market and measurably increases the efficiency with which employees work. With well-implemented search, the enterprise can reap the benefits of return on investment and on information, with highly productive, relevant search – with a low-maintenance, easy-to-use solution that optimizes the value of business information while minimizing costs. With these factors in mind, organizations that take an informed approach to enterprise search can create a substantial and ongoing competitive advantage and achieve measurable returns.
Implementing Google-like enterprise search is a lot harder than it appears and in reality, we want to also implement additional requirements that go beyond the “simple” Google-like search, but today’s line up of search vendors have a wide variety of features and capabilities to satisfy these requirements. Using these tools still requires highly skilled individuals to implement them to their full potential.
Special thanks to my colleagues Akhil Raj and Hari Dosapati for their invaluable feedback with this piece.