Hierachical facets with solr

If you have been working with TYPO3 search engine "indexed_search" before, you might perhaps know that there is a selector that lets the user restrict the search results to certain parts of the page tree. Since SOLR seems to be the way to go concerning search with TYPO3, here is a little tutorial on how a similar feature can be implemented with the TYPO3 solr extension.

For a comprehensive introduction on implementing hierarchical facets and filters with SOLR, take a look at:

http://wiki.apache.org/solr/HierarchicalFaceting

 

Storing TYPO3 page-tree information in SOLR index

What we want to have looks like this: Assuming we have a page tree like

1
|-2
|  |-3
|  |  |-4
...

so we have a rootline for Page 4 which is 1/2/3/4. As SOLR cannot handle hierarchical data like paths directly we have to do a little trick. We cut the rootline in parts and create snippets for each "depth" of the rootline. Those snippets look like that: 0-1, 1-1/2, 2-1/2/3, 3-1/2/3/4 where the first number encodes the depth of the snippet followed by a "-" and the rootline for this depth.

For TYPO3 solr extension with version prior to 2.0 following changes of the schema are required:

<!--    
Multivalue field for rootline snippets like
0-1, 1-1/10, 2-1/10/100, 3-1/10/100/111

THIS IS ONLY REQUIRED FOR TX_SOLR VERSIONS < 2.0
-->
<field name="rootline" type="string" indexed="true" stored="true" multiValued="true" />
 

Rootline field in general_schema_fields.xml

Extending solr extension for hierarchical page tree

The next step is to add this field for pages to the TypoScript setup for solr. This can be done within the section plugin.tx_solr.index.fieldProcessingInstructions:

plugin.tx_solr {
	index {
		# assigns processing instructions to Solr fields 
		# during indexing, Solr field = processing instruction
		fieldProcessingInstructions {
			rootline = pageUidToHierarchy
		}

		queue {
				pages {
						fields {
								...
								# copy content of "pid" value into rootline field.
								# field modifier assigned above will convert pid to rootline.
								rootline = pid
						}
				}
		}
	}
}
TypoScript Configuration for rootline field processor

Now we need a field processor class, which has to implement an interface "tx_solr_FieldProcessor". This field processor gets a plain pageUid and transforms it to the rootline snippets which are returned to be inserted into the SOLR index:

class tx_solr_fieldprocessor_PageUidToHierarchy implements tx_solr_FieldProcessor {

	/**
	 * Expects a PID of a page.
	 *
	 * Returns solr hierarchy notation of rootline of pid
	 *
	 * @param	array	Array of values, an array because of multivalued fields
	 * @return	array	Modified array of values
	 */
	public function process(array $values) {
		$results = array();

		foreach ($values as $value) {
			$solrHierarchyNotation = $this->getSolrRootlineForPid($value);
			$results[] = $solrHierarchyNotation;
		}
		return $results;
	}

	/**
	 * Returns a solr hierarchy notation string for rootline of given PID
	 *
	 * @param $pid
	 * @return string
    	 */
	protected function getSolrRootlineForPid($pid) {
		$obj_page = t3lib_div::makeInstance('t3lib_pageSelect');
		$rootline = $obj_page->getRootLine($pid);

		$rootlinePids = array();
		foreach($rootline as $depth => $pageInRootline) {
			array_unshift($rootlinePids, $pageInRootline['pid']);
		}

		$rootlinePids[] = $pid;

		return $this->getSolrHierarchyByRootlinePids($rootlinePids);
    	}

	/**
	 * Returns a solr hierarchy notation string for an array of pids that make up a rootline
	 *
	 * @param $rootlinePids
	 * @return string
	 */
	protected function getSolrHierarchyByRootlinePids($rootlinePids) {
		$hierarchies = array();
		$depth = 0;
		$currentHierarchy = array_shift($rootlinePids);
		foreach($rootlinePids as $rootlinePid) {
			$hierarchies[] = $depth . '-' . $currentHierarchy;
			$depth++;
			$currentHierarchy .= '/' . $rootlinePid;
		}
		$hierarchies[] = $depth . '-' . $currentHierarchy;
		return $hierarchies;
	}
}
PageUidToHierarchy Processor

The next step is a little ugly, as we have to touch the solr extension code itself. In order to make the rootline processor work, we have to add some line of code to the "tx_solr_fieldprocessor_Service":

switch ($instruction) {
	/* ... */
	case 'pageUidToHierarchy':
		$processor = t3lib_div::makeInstance('tx_solr_fieldprocessor_PageUidToHierarchy');
		$fieldValue = $processor->process($fieldValue);
		break;
	/* ... */
}
Additional case in tx_solr_fieldprocessor_Service

Don't forget to add your field processor class to solr's ext_autoload.php.

If everything went fine, you should get a new field in your index that looks similar to this one:

<arr name="rootline">
	<str>0/0</str>
	<str>1/0/1</str>
	<str>2/0/1/38</str>
	<str>3/0/1/38/39</str>
	<str>4/0/1/38/39/40</str>
</arr>
SOLR field content for rootline field

Facet query for page tree hierarchy

Once the rootline snippets are stored in the index, we can now use different facet queries to get the information we want to have from the index.

The simplest query is one that counts all documents for each possible rootline:

select?facet=true&facet.mincount=0&facet.sort=count
&facet.field=rootline&q=*:*&rows=0
SOLR query for all rootlines with rowcount (mincount=0 enables empty rootlines)

You will get a response similar to this one

<lst name="facet_counts">
	<lst name="facet_queries"/>
	<lst name="facet_fields">
		<lst name="rootline">
			<int name="1-1">32</int>
			<int name="2-1/7">6</int>
			<int name="2-1/27">5</int>
			<int name="2-1/38">3</int>
			<int name="3-1/27/32">3</int>
			<!-- ... -->
		</lst>
		<lst name="facet_dates"/>
		<lst name="facet_ranges"/>
	</lst>
</lst>
Facet results for rootline snippets

Perhaps you get an impression now, why it was a good choice not only to store the rootline at once but every snippet. By doing so, we get all query result count for all subpages at once.

Let's do a little more complicated query now by querying for all rootlines that have a depth of 3 and starting with 0/1/. We therefore use a prefix facet:

select?facet=true&facet.mincount=0&facet.sort=count
&facet.field=rootline&facet.prefix=3-1/&q=*:*&rows=0
SOLR query for all rootlines with prefix 3-1/

This will give us the following result:

<lst name="facet_counts">
	<lst name="facet_queries"/>
	<lst name="facet_fields">
		<lst name="rootline">
			<int name="3-1/27/32">3</int>
			<int name="3-1/38/39">2</int>
			<int name="3-1/14/31">1</int>
			<int name="3-1/19/20">1</int>
			<int name="3-1/27/29">1</int>
			<int name="3-1/7/10">1</int>
			<int name="3-1/7/11">1</int>
			<int name="3-1/7/12">1</int>
			<int name="3-1/7/13">1</int>
			<int name="3-1/7/8">1</int>
		</lst>
	</lst>
	<lst name="facet_dates"/>
	<lst name="facet_ranges"/>
</lst>
Facet result for rootlines with prefix 3-1/

Using rootline information for filtering search results

Last but not least, we should think about how to use the information gathered from the facets for restricting our search results. We therefore use a simply filter query on the field rootline:

// all documents with rootline 3-1/27/32 
select?q=*:*&fq=rootline:3-1/27/32

// OR query for two different rootlines
select?q=*:*&fq=rootline:3-1/27/32%20OR%20rootline:2-1/41
Query for documents within rootline 3-1/27/32

Fazit

This post should give you an impression on how to realize hierarchical facets with SOLR and TYPO3 solr extension. It is now up to your needs how to render the facets and building a nice user interface for it. Once you start playing around with these features you will surely come up with many possible applications. Have fun!


 
Inhalt © Michael Knoll 2009-2017  •  Powered by TYPO3  •  TypoScript Blogging by Fabrizio Branca  •  TYPO3 Photo Gallery Management by yag  •  Impressum