tokenizing twitter posts in lucene

user warning: Table './johnandcailincmsdb/node_counter' is marked as crashed and should be repaired query: SELECT totalcount, daycount, timestamp FROM node_counter WHERE nid = 2746 in /var/www/drupal/includes/ on line 172.

solr lucene is good technology to use for searching over a corpus of tweets. if you take the content of a tweet and dump that into the default solr lucene "text" field, you'll do pretty well. however, if you look at your results closely, you'll find one subtle, but very annoying problem: searches on a hashtag term will match the non-hashtag term.

for example, a search on the term #cassandra will match tweets containing the term cassandra. this is annoying, as there is a semantic difference between the two. tweets containing #cassandra are pretty likely to be related to the cassandra nosql database, whereas tweets containing just cassandra cover a wide variety of topics.

the behavior we desire is summarized in the table, below.

Tweet Content Query Result
#Oracle Oracle Matched
Oracle Oracle Matched
Oracle #Oracle Not Matched

i'm sure there are many ways to obtain this result, but after much experimentation, here's what i came up with for the schema definition of my text field type.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[\W&amp;&amp;[^@#]]"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="com.johnandcailin.solr.analysis.TwitterWordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

for the index analyzer, we start by tokenizing using the PatternTokenizerFactory. the regular expression provided tokenizes on all non-word characters except @ and #. then we apply the standard stop words removal. and then comes the fancy part.

when indexing the term #cassandra we wish to store both the term #casssandra and the term cassandra. the easiest way i found to do this was to write two custom classes, and These two classes are exact replicas of WordDelimiterFilterFactory and WordDelimiterFilter, with a two line change that instructs the filter to treat the @ and # characters as if they were digits.

note that the index analyzer and the query analyzer are not symmetric. we do not want to split a search term for #cassandra and cassandra, as that would send us right back to our original problem in which #cassandra matches tweets containing only cassandra.

need help getting this set-up in your solr lucene environment? drop me a line and i'll pass along some more tips.

Very useful, thanks! An

Very useful, thanks!

An alternative I have used often (when you control the creation of SOLR documents at indexing time) is to especifically extract #terms and @terms from the text and add them to a new field which is not tekenised (you can either separate tags with whitespace or use a multivalued field).

This has the bonus of giving you facets on the field (which can be quite useful), but has the disadvantage that you need to write more complicated OR queries such as:

cassandra OR tags:#cassandra

Wow.. awesome post, very

Wow.. awesome post, very useful for me.. thanks a lot

post new comment

the content of this field is kept private and will not be shown publicly.
  • web page addresses and e-mail addresses turn into links automatically.
  • allowed html tags: <h2> <h3> <h4> <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • lines and paragraphs break automatically.
  • you may post code using <code>...</code> (generic) or <?php ... ?> (highlighted php) tags.

more information about formatting options

are you human? we hope so.
copy the characters (respecting upper/lower case) from the image.