Creating Search Schemas
Note on Search 2.0 vs. Legacy Search
This document refers to the new Riak Search 2.0 with Solr integration (codenamed Yokozuna).
Riak Search is built for ease of use, allowing you to write values into Riak and query for values using Solr. Riak Search does a lot of work under the hood to convert your values—plain text, JSON, XML, Riak Data Types, and more—into something that can be indexed and searched later. Nonetheless, you must still instruct Riak/Solr how to index a value. Are you providing and array of strings? An integer? A date? Is your text in English or Russian? You can provide such instructions to Riak Search by defining a Solr schema.
The Default Schema
Riak Search comes bundled with a default schema named _yz_default
. The
default schema covers a wide range of possible field types. You can find
the default schema on
GitHub.
While using the default schema provides an easy path to starting
development, we recommend that you define your own schema in production.
Take note of dynamicField name="*"
, which is a catch-all index for any
value. Sufficiently sized objects can potentially take up tremendous
amounts of disk space, so pay special attention to those indexes.
Custom Schemas
We’ll show you how you can create custom schemas by way of example.
Let’s say that you have already created a schema named cartoons
in a
file named cartoons.xml
. This would register the custom schema in Riak
Search:
import org.apache.commons.io.FileUtils;
File xml = new File("cartoons.xml");
String xmlString = FileUtils.readFileToString(xml);
YokozunaSchema schema = new YokozunaSchema("cartoons", xmlString);
StoreSchema storeSchemaOp = new StoreSchema.Builder(schema).build();
client.execute(storeSchemaOp);
schema_data = File.read("cartoons.xml")
client.create_search_schema("cartoons", schema_data)
(new \Basho\Riak\Command\Builder\Search\StoreSchema($riak))
->withName('users')
->withSchemaFile('path/to/file.xml')
->build()
->execute();
xml_file = open('cartoons.xml', 'r')
schema_data = xml_file.read()
client.create_search_schema('cartoons', schema_data)
xml_file.close()
var xml = File.ReadAllText("cartoons.xml");
var schema = new SearchSchema("cartoons", xml);
var rslt = client.PutSearchSchema(schema);
var fs = require('fs');
fs.readFile('cartoons.xml', function (err, data) {
if (err) {
throw new Error(err);
}
var schemaXml = data.toString('utf8'));
var options = {
schemaName: 'blog_post_schema',
schema: schemaXml
};
client.storeSchema(options, function (err, rslt) {
if (err) {
throw new Error(err);
}
});
});
{ok, SchemaData} = file:read_file("cartoons.xml"),
riakc_pb_socket:create_search_schema(Pid, <<"cartoons">>, SchemaData).
curl -XPUT http://localhost:8098/search/schema/cartoons \
-H 'Content-Type:application/xml' \
--data-binary @cartoons.xml
Creating a Custom Schema
The first step in creating a custom schema is to define exactly what fields you must index. Part of that step is understanding how Riak Search extractors function.
Extractors
In Riak Search, extractors are modules responsible for pulling out a list of fields and values from a Riak object. How this is achieved depends on the object’s content type, but the two common cases are JSON and XML, which operate similarly. Our examples here will use JSON.
The following JSON object represents the character Lion-o from the cartoon Thundercats. He has a name and age, he’s the team leader, and he has a list of aliases in other languages.
{
"name":"Lion-o",
"age":30,
"leader":true,
"aliases":[
{"name":"León-O", "desc_es":"Señor de los ThunderCats"},
{"name":"Starlion", "desc_fr":"Le jeune seigneur des Cosmocats"},
]
}
The extractor will flatten the above objects into a list of field/value
pairs. Nested objects will be separated with a dot (.
) and arrays will
simply repeat the fields. The above object will be extracted to the
following list of Solr document fields.
name=Lion-o
age=30
leader=true
aliases.name=León-O
aliases.desc_es=Señor de los ThunderCats
aliases.name=Starlion
aliases.desc_fr=Le jeune seigneur des Cosmocats
This means that our schema should handle name
, age
, leader
,
aliases.name
(a dot
is a valid field character), and
aliases.desc_*
which is a description in the given language of the
suffix (Spanish and French).
Required Schema Fields
Solr schemas can be very complex, containing many types and analyzers.
Refer to the Solr 4.7 reference
guide
for a complete list. You should be aware, however, that there are a few
fields that are required by Riak Search in order to properly distribute
an object across a cluster. These fields are all prefixed
with _yz
, which stands for
Yokozuna, the original code name
for Riak Search.
Below is a bare minimum skeleton Solr Schema. It won’t do much for you other than allow Riak Search to properly manage your stored objects.
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="schedule" version="1.5">
<fields>
<!-- All of these fields are required by Riak Search -->
<field name="_yz_id" type="_yz_str" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="_yz_ed" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_pn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_fpn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_vtag" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_rk" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rt" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rb" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_err" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
</fields>
<uniqueKey>_yz_id</uniqueKey>
<types>
<!-- YZ String: Used for non-analyzed fields -->
<fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true" />
</types>
</schema>
If you’re missing any of the above fields, Riak Search will reject your
custom schema. The value for <uniqueKey>
must be _yz_id
.
In the table below, you’ll find a description of the various required
fields. You’ll rarely need to use any fields other than _yz_rt
(bucket
type), _yz_rb
(bucket) and _yz_rk
(Riak key). On occasion, _yz_err
can be helpful if you suspect that your extractors are failing.
Malformed JSON or XML will cause Riak Search to index a key and set
_yz_err
to 1, allowing you to reindex with proper values later.
Field | Name | Description |
---|---|---|
_yz_id |
ID | Unique identifier of this Solr document |
_yz_ed |
Entropy Data | Data related to active anti-entropy |
_yz_pn |
Partition Number | Used as a filter query parameter to remove duplicate replicas across nodes |
_yz_fpn |
First Partition Number | The first partition in this doc’s preflist, used for further filtering on overlapping partitions |
_yz_vtag |
VTag | If there is a sibling, use vtag to differentiate them |
_yz_rk |
Riak Key | The key of the Riak object this doc corresponds to |
_yz_rt |
Riak Bucket Type | The bucket type of the Riak object this doc corresponds to |
_yz_rb |
Riak Bucket | The bucket of the Riak object this doc corresponds to |
_yz_err |
Error Flag | indicating if this doc is the product of a failed object extraction |
Defining Fields
With your required fields known and the skeleton schema elements in place, it’s time to add your own fields. Since you know your object structure, you need to map the name and type of each field (a string, integer, boolean, etc).
When creating fields you can either create specific fields via the
field
element or an asterisk (*
) wildcard field via dynamicField
.
Any field that matches a specific field name will win, and if not, it
will attempt to match a dynamic field pattern.
Besides a field type
, you also must decide if a value is to be
indexed
(usually true
) and stored
. When a value is stored
that
means that you can get the value back as a result of a query, but it
also doubles the storage of the field (once in Riak, again in Solr). If
a single Riak object can have more than one copy of the same matching
field, you also must set multiValued
to true
.
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="schedule" version="1.5">
<fields>
<field name="name" type="string" indexed="true" stored="true" />
<field name="age" type="int" indexed="true" stored="false" />
<field name="leader" type="boolean" indexed="true" stored="false" />
<field name="aliases.name" type="string" indexed="true" stored="true" multiValued="true" />
<dynamicField name="*_es" type="text_es" indexed="true" stored="true" multiValued="true" />
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" multiValued="true" />
<!-- All of these fields are required by Riak Search -->
<field name="_yz_id" type="_yz_str" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="_yz_ed" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_pn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_fpn" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_vtag" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
<field name="_yz_rk" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rt" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_rb" type="_yz_str" indexed="true" stored="true" multiValued="false"/>
<field name="_yz_err" type="_yz_str" indexed="true" stored="false" multiValued="false"/>
</fields>
<uniqueKey>_yz_id</uniqueKey>
Next, take note of the types you used in the fields and ensure that each
of the field types are defined as a fieldType
under the types
element. Basic types such as string
, boolean
, int
have matching
Solr classes. There are dozens more types, including many kinds of
number (float
, tdouble
, random
), date
fields, and even
geolocation types.
Besides simple field types, you can also customize analyzers for
different languages. In our example, we mapped any field that ends with
*_es
to Spanish, and *_de
to German.
<types>
<!-- YZ String: Used for non-analyzed fields -->
<fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true" />
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
<filter class="solr.SpanishLightStemFilterFactory"/>
<!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="Spanish"/> -->
</analyzer>
</fieldType>
<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" />
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
<!-- less aggressive: <filter class="solr.GermanMinimalStemFilterFactory"/> -->
<!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="German2"/> -->
</analyzer>
</fieldType>
</types>
</schema>
“Catch-All” Field
Without a catch-all field, an exception will be thrown if data is
provided to index without a corresponding <field>
element. The
following is the catch-all field from the default Yokozuna schema and
can be used in a custom schema as well.
<dynamicField name="*" type="ignored" />
The following is required to be a child of the types
element in the
schema:
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
Dates
The format of strings that represents a date/time is important as Solr
only understands ISO8601 UTC date/time
values.
An example of a correctly formatted date/time string is
1995-12-31T23:59:59Z
. If you provide an incorrectly formatted
date/time value, an exception similar to this will be logged to
solr.log
:
2014-02-27 21:30:00,372 [ERROR] <qtp1481681868-421>@SolrException.java:108 org.apache.solr.common.SolrException: Invalid Date String:'Thu Feb 27 21:29:59 +0000 2014'
at org.apache.solr.schema.DateField.parseMath(DateField.java:182)
at org.apache.solr.schema.TrieField.createField(TrieField.java:611)
at org.apache.solr.schema.TrieField.createFields(TrieField.java:650)
at org.apache.solr.schema.TrieDateField.createFields(TrieDateField.java:157)
at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:47)
...
...
...
Field Properties By Use Case
Sometimes it can be tricky to decide whether a value should be stored
,
or whether multiValued
is allowed. This handy table from the Solr
documentation
may help you pick field properties.
An entry of true
or false
in the table indicates that the option
must be set to the given value for the use case to function correctly.
If no entry is provided, the setting of that attribute has no impact on
the case.
Use Case | indexed |
stored |
multiValued |
omitNorms |
termVectors |
termPositions |
---|---|---|---|---|---|---|
search within field | true |
|||||
retrieve contents | true |
|||||
use as unique key | true |
false |
||||
sort on field | true |
false |
true [1](#notes) |
|||
use field boosts[5](#notes) | false |
|||||
document boosts affect searches within field | false |
|||||
highlighting | true [4](#notes) |
true |
[2](#notes) | true [3](#notes) |
||
faceting[5](#notes) | true |
|||||
add multiple values, maintaining order | true |
|||||
field length affects doc score | false |
|||||
MoreLikeThis[5](#notes) | true [6](#notes) |
{analyzer_factory, {erlang, text_analyzers, noop_analyzer_factory}}}