Custom RBBI rules for ICU tokenizer that don't override all defaults
06:34 01 Feb 2026

I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.

I got the custom rules up in SolrCloud and working, but it appears that as soon as I added

"tokenizer": {
                "name": "icu",
                "rulefiles": "Latn:hyphen-preserving.rbbi"
            }

only the rules inside the RBBI file apply

!!chain;

$ALetter = [:L:];
$Numeric = [:N:];
$MidHyphen = [-];

$ALetter ($MidHyphen $ALetter)+ {200};
$Numeric ($MidHyphen $Numeric)+ {200};

and other rules are thrown away. Resulting in wrong (severely limited) tokenization.

Is there a default RBBI file available I can adapt?

Or, is there a way to only overwrite specific RBBI rules?

solr solrcloud icu