Sitecore, Solr, and Many languages
Sitecore 7 added a content search API to interact with Lucene and Solr. I'm sure anyone who has ever worked with search will tell you that search is hard, as it requires a lot of customisation that is entirely per-site, and what works for someone else might not work for you.
I'm here to tell you what worked for me, a really specific use case involving Sitecore 8.1 update 4, Apache Solr 6.2, and searching 7 regions with 4 different languages.
We're using some internal libraries on top of the content search API, but they eventually make the same calls as everyone else.
We started out with a fairly standard content search, which... mostly worked, even across languages. Condensed form:
var context = SearchIndex.CreateSearchContext(); var query = context.GetQueryable<oursearchresults>(); query.Content.Like(queryArgs);
There are actually a few issues with this approach:
The way our site is set up, 90% of the content we care about is actually in an item's components, not on the item itself.
This treats all languages the same way. Sitecore will send the same query to solr no matter the language being searched: _content:(*queryArgs*)
This will only give exact matches (even though .Like() is used)
Issue 1 is solved with a computed index field.
public class VisualizationField : MediaItemContentExtractor { public override object ComputeFieldValue(IIndexable indexable) { string baseValue = base.ComputeFieldValue(indexable) as string; Item indexItem = indexable as SitecoreIndexableItem; if (!ShouldIndexItem(indexItem)) { return baseValue; } var dataSources = Globals.LinkDatabase .GetReferences(indexItem) .Where(link => ShouldProcessLink(link, indexItem)) .Select(link => link.GetTargetItem()) .Where(targetItem => targetItem != null && targetItem.Versions.Count > 0) .Distinct(); var result = new StringBuilder(); if (!string.IsNullOrEmpty(baseValue)) { result.AppendLine(baseValue); } foreach (var dataSource in dataSources.Where(ShouldIndexDataSource)) { dataSource.Fields.ReadAll(); foreach (var field in dataSource.Fields.Where(ShouldIndexField)) { result.AppendLine(field.Value); } } return result.ToString(); } }
The ShouldProcess and ShouldIndex methods check to see whether or not something is actually related, and whether or not something should be put into the solr index based on some pretty basic parameters (correct content type, whether or not the component is actually being rendered).
Issue 2 caused me a great deal of stress until I stumbled across a blog post from the Sitecore 7 era. Sitecore added the concept of CultureExecutionContexts, which is a really fancy way of saying you can tell Sitecore to send over a search for content_t_{lang} instead of just _content by using this:
var context = SearchIndex.CreateSearchContext(); var culture = new CultureInfo(Sitecore.Context.Language.Name); var cultureCtx = new CultureExecutionContext(culture); var query = context.GetQueryable<oursearchresults>(cultureCtx); query.Content.Like(queryArgs);
And now your solr queries will look like this:
`content_t_{lang}:(*queryArgs*)`
Huzzah! You're searching specific languages! The problem quickly becomes, now you're doing language-specific exact match queries, which isn't very helpful.
Enter stemming algorithms.
The basic idea is that you give solr a word like engineer, and it boils the word down to the word's stem, so that you can run queries like engineer, engineers, engineered, or engineering and it will give you the same results. There are stemmers for basically every language you can think of, and the solr documentation explains how to use them far better than I ever could. The example schema.xml file generated by Sitecore actually contains basic analyzers that work fairly well. You will likely want to tweak them to fit your needs, but for an out-of-the-box solution, they work.
Once you've put the correct analyzers in place, restarted solr (this is important, solr does not pick up schema changes on the fly), and reindexed, you should now be getting decent search results in multiple languages.
Now is when language-specifics come into play. One of the languages this client supports is Polish, which does not come with out-of-the-box support from solr. Thankfully, there are already instructions for how to set that up.
The problem language for us, so far, has been German. German is what's known as a fusional language, which means that they tend to make new words by shoving old ones together. For instance, the German word for engineer is "ingenieur" and the word for civil engineer is "bauingenieur." This creates an issue for our search purposes, as "bauingenieur" and "ingenieur" should both return results for "ingenieur." The problem is solved with the Dictionary Compound Word Token Filter, a solr filter that will break words like bauingenieur down into their components "bau" and "ingenieur," so your results become what you'd expect. This requires a German word list, which can be a bit tricky to find, but once you have it, it works beautifully.
At this point, our search results have become downright useful and accurate (though we haven't implemented nice-to-haves like spellchecking and synonym searches), but there's a subtle bug. Sitecore isn't sending over the _content field to solr for each individual language properly. If your setup is like ours, with a very thin item and all of the pertinent content in subcomponents, the _content field in the index is going to be very sparse, basically containing nothing but the content in the top level item itself.
This is a subtle bug, and one that took several hours of debugging and someone far more versed in Sitecore than me to finally solve, but the issue is in the computed index field for the _content field.
var dataSources = Globals.LinkDatabase .GetReferences(indexItem) .Where(link => ShouldProcessLink(link, indexItem)) .Select(link => link.GetTargetItem()) .Where(targetItem => targetItem != null && targetItem.Versions.Count > 0) .Distinct();
This code will only get the components in Sitecore's default language. The rest of the code will properly put the correct language content from the top-level item into the index, but one of the checks it makes is whether or not a component is in the layout of that item in that version and language. If you have an item that only exists in the default Sitecore language, this works fine, but for any other language it's not going to get any of the subcomponents.
I haven't found any documentation about this, but the solution that is working for us is bringing in a LanguageSwitcher:
using (var switcher = new LanguageSwitcher(indexItem.Language)) { public class VisualizationField : MediaItemContentExtractor { public override object ComputeFieldValue(IIndexable indexable) { string baseValue = base.ComputeFieldValue(indexable) as string; Item indexItem = indexable as SitecoreIndexableItem; if (!ShouldIndexItem(indexItem)) { return baseValue; } var dataSources = Globals.LinkDatabase .GetReferences(indexItem) .Where(link => ShouldProcessLink(link, indexItem)) .Select(link => link.GetTargetItem()) .Where(targetItem => targetItem != null && targetItem.Versions.Count > 0) .Distinct(); var result = new StringBuilder(); if (!string.IsNullOrEmpty(baseValue)) { result.AppendLine(baseValue); } foreach (var dataSource in dataSources.Where(ShouldIndexDataSource)) { dataSource.Fields.ReadAll(); foreach (var field in dataSource.Fields.Where(ShouldIndexField)) { result.AppendLine(field.Value); } } return result.ToString(); } } }
Once you rebuild and reindex with the proper computed index field, your components will be properly indexed, your search results correct, and, hopefully, your clients happy.















