tokenizing twitter posts in lucene

solr lucene is good technology to use for searching over a corpus of tweets. if you take the content of a tweet and dump that into the default solr lucene "text" field, you'll do pretty well. however, if you look at your results closely, you'll find one subtle, but very annoying problem: searches on a hashtag term will match the non-hashtag term.

