Diacritics & search
Posted: Tue Oct 23, 2007 10:47 am
I wanted to gauge what's the sense of other users about the use of diacritics in searching references.
What I call here diacritic-sensitive search is that diacritics are treated as distinct characters. For instance, a user typing "Lopez" gets "Lopez"; to retrieve "López," the user has to type "López". In contrast, I define diacritic-blind search as typing "Lopez" and getting "Lopez" and "López". The point I advance here is that searches in the list view and the advanced search window (activated with command-f in BE) should disregard diacritics in the result. I don't think the logic is different than in the case of case-sensitive or insensitive searches.
At least from my own experience with Romance Languages and particularly Spanish, diacritic-sensitive searches are not a worthy feature but a nuisance. They difficult retrieving authors spelled inconsistently, it makes it difficult to find word variations that are accented and unaccented, and it makes it difficult to search authors/words spelled in languages one is not fluent in.
At least in the Romance languages I know diacritics don't make a vowel
different, they just make the pronunciation different. That's why
diacritics don't affect the sorting in a list. Also, diacritics change
according to inflections in the word. So, for instance, "población"
has an accent, "poblaciones" doesn't. So in a language the user knows well, the user has to repeat the search several times to retrieve desired results.
Even more, for proper names (geographic places & authors), case-insensitiveness is particularly tricky. Typically "López" in the US is spelled "Lopez". Mexico is México south of the border; not to mention Peru-Perú, PotosÃ-Potosi, and the list can go on. Even an author publishing in the US and in a Spanish-speaking country may be spelled differently. Compounding that problem, libraries, electronic resources and bibliographic collections may not record the diacritic (they've been doing that more consistently lately).
Diacritics get more in the way when users search in a language that is not their own or whose rules have changed in the time period covered in the bibliography. Rules for use of diacritics have changed significantly in the last century/ies; in Portuguese, for instance, accent rules changed in Brazil in the twentieth century, and rules are now different on each side of the Atlantic. I believe reforms were proposed (I'm unsure if passed) in German and French. In Spanish there's no accent consistency until well advanced the 19th century.
I believe the reasons are cogent enough that major library search portals are diacritic-insensitive. I don't think any academic or non-academic search service I've used recently is diacritic sensitive.
Finally, other database programs are diacritic blind, for instance Filemaker (which I use for my primary sources in part due to the inconsistency of ortographic rules over a long period of time) and Bibdesk.
What I call here diacritic-sensitive search is that diacritics are treated as distinct characters. For instance, a user typing "Lopez" gets "Lopez"; to retrieve "López," the user has to type "López". In contrast, I define diacritic-blind search as typing "Lopez" and getting "Lopez" and "López". The point I advance here is that searches in the list view and the advanced search window (activated with command-f in BE) should disregard diacritics in the result. I don't think the logic is different than in the case of case-sensitive or insensitive searches.
At least from my own experience with Romance Languages and particularly Spanish, diacritic-sensitive searches are not a worthy feature but a nuisance. They difficult retrieving authors spelled inconsistently, it makes it difficult to find word variations that are accented and unaccented, and it makes it difficult to search authors/words spelled in languages one is not fluent in.
At least in the Romance languages I know diacritics don't make a vowel
different, they just make the pronunciation different. That's why
diacritics don't affect the sorting in a list. Also, diacritics change
according to inflections in the word. So, for instance, "población"
has an accent, "poblaciones" doesn't. So in a language the user knows well, the user has to repeat the search several times to retrieve desired results.
Even more, for proper names (geographic places & authors), case-insensitiveness is particularly tricky. Typically "López" in the US is spelled "Lopez". Mexico is México south of the border; not to mention Peru-Perú, PotosÃ-Potosi, and the list can go on. Even an author publishing in the US and in a Spanish-speaking country may be spelled differently. Compounding that problem, libraries, electronic resources and bibliographic collections may not record the diacritic (they've been doing that more consistently lately).
Diacritics get more in the way when users search in a language that is not their own or whose rules have changed in the time period covered in the bibliography. Rules for use of diacritics have changed significantly in the last century/ies; in Portuguese, for instance, accent rules changed in Brazil in the twentieth century, and rules are now different on each side of the Atlantic. I believe reforms were proposed (I'm unsure if passed) in German and French. In Spanish there's no accent consistency until well advanced the 19th century.
I believe the reasons are cogent enough that major library search portals are diacritic-insensitive. I don't think any academic or non-academic search service I've used recently is diacritic sensitive.
Finally, other database programs are diacritic blind, for instance Filemaker (which I use for my primary sources in part due to the inconsistency of ortographic rules over a long period of time) and Bibdesk.