I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.
I got the custom rules up in SolrCloud and working, but it appears that as soon as I added
"tokenizer": {
"name": "icu",
"rulefiles": "Latn:hyphen-preserving.rbbi"
}
only the rules inside the RBBI file apply
!!chain;
$ALetter = [:L:];
$Numeric = [:N:];
$MidHyphen = [-];
$ALetter ($MidHyphen $ALetter)+ {200};
$Numeric ($MidHyphen $Numeric)+ {200};
and other rules are thrown away. Resulting in wrong (severely limited) tokenization.
Is there a default RBBI file available I can adapt?
Or, is there a way to only overwrite specific RBBI rules?