User:MarkAHershberger/Weekly reports/2010-W10

Normalizing fullwidth characters

edit

Following up on philip's change for normalization of fullwidth latin characters (A/U+FF21 through z/U+FF5A), I wrote and committed a script that takes care of fullwidth character normalization.

Part of this work also refactored the updateSearchIndex.php maintenance script so that this and future upgrade scripts can simply provide a “find” function and a “fix” callback function (see r63578.).

While doing this work, I discovered that only MySQL was normalizing the fullwidth characters and, further, normalization was only happening for Chinese and Japanese languages. While these languages may be more likely to include the fullwidth characters, examples of fullwidth characters can be found on the English Wikipedia, so there is good reason to normalize on all languages.

After discussing this with TimStarling and agreeing with him that the normalization had to be fast enough, I've begun adding normalization to all the search backends for all languages. This coming week I will need to profile it and ensure that the common case (no normalization) doesn't add (too much) overhead.

Escaping MySQL utf8 encoding

edit

I Found that, while fullwidth UTF8 characters and dots in domain names are encoded using “u8” sequences for MySQL's DB search, these sequences are not escaped in any way. This means that a search for “example.com” will also hit (unlikely, but possible) “exampleu82ecomu800”. Since only alphanumerics are allowed in the searchindex tables at present, the u8 sequences can be encoded using angle brackets. I began testing this change last week and should have it committed ASAP.

Other Search Oddities

edit

During all this investigation, I turned up one more bit of strangeness: lucene-search doesn't seem to normalize numbers, but Wikipedia searches do. I'm not sure what the difference is at the present. It could just be my installation is screwy, but I'll track it down this week.