Diacritics & search

aechallu · Post by **aechallu** » Tue Oct 23, 2007 10:47 am

I wanted to gauge what's the sense of other users about the use of diacritics in searching references.

What I call here diacritic-sensitive search is that diacritics are treated as distinct characters. For instance, a user typing "Lopez" gets "Lopez"; to retrieve "LÃ³pez," the user has to type "LÃ³pez". In contrast, I define diacritic-blind search as typing "Lopez" and getting "Lopez" and "LÃ³pez". The point I advance here is that searches in the list view and the advanced search window (activated with command-f in BE) should disregard diacritics in the result. I don't think the logic is different than in the case of case-sensitive or insensitive searches.

At least from my own experience with Romance Languages and particularly Spanish, diacritic-sensitive searches are not a worthy feature but a nuisance. They difficult retrieving authors spelled inconsistently, it makes it difficult to find word variations that are accented and unaccented, and it makes it difficult to search authors/words spelled in languages one is not fluent in.

At least in the Romance languages I know diacritics don't make a vowel
different, they just make the pronunciation different. That's why
diacritics don't affect the sorting in a list. Also, diacritics change
according to inflections in the word. So, for instance, "poblaciÃ³n"
has an accent, "poblaciones" doesn't. So in a language the user knows well, the user has to repeat the search several times to retrieve desired results.

Even more, for proper names (geographic places & authors), case-insensitiveness is particularly tricky. Typically "LÃ³pez" in the US is spelled "Lopez". Mexico is MÃ©xico south of the border; not to mention Peru-PerÃº, PotosÃ-Potosi, and the list can go on. Even an author publishing in the US and in a Spanish-speaking country may be spelled differently. Compounding that problem, libraries, electronic resources and bibliographic collections may not record the diacritic (they've been doing that more consistently lately).

Diacritics get more in the way when users search in a language that is not their own or whose rules have changed in the time period covered in the bibliography. Rules for use of diacritics have changed significantly in the last century/ies; in Portuguese, for instance, accent rules changed in Brazil in the twentieth century, and rules are now different on each side of the Atlantic. I believe reforms were proposed (I'm unsure if passed) in German and French. In Spanish there's no accent consistency until well advanced the 19th century.

I believe the reasons are cogent enough that major library search portals are diacritic-insensitive. I don't think any academic or non-academic search service I've used recently is diacritic sensitive.

Finally, other database programs are diacritic blind, for instance Filemaker (which I use for my primary sources in part due to the inconsistency of ortographic rules over a long period of time) and Bibdesk.

Reiner · Post by **Reiner** » Tue Oct 23, 2007 11:01 am

I voted for "Yes" because in most cases I will prefer the diacritic-sensitive way of searching (which makes much sense in German language). But I can imagine there will also be seldom circumstances under which I would like to search diacritic-insensitive.

Therefor the best would be not to change the behaviour from sensitive to insenstive but to introduce a checkbox in the search field as well as in the advanced search to choose which way your actual search should act. And of course it would be nice if the last setting would be saved for your next search.

aechallu · Post by **aechallu** » Tue Oct 23, 2007 3:38 pm

Making it optional sounds perfect to me. To avoid crowding the list view toolbar I would create a checkbox under Preferences/Lists (default set to no diacritics, as it is the way it works now) because users won't change this setting very often. Then the Search 'Database' window can have a checkbox as well.

Post by **Jon** » Tue Oct 23, 2007 5:16 pm

Note that making it optional is not practical. The indexes would have to be collated one way or the other, and once created could not be changed.

Jon
Sonny Software

hareiko · Post by **hareiko** » Wed Oct 24, 2007 5:14 am

Dear Jon,

This is indeed a nuisance, not only in Spanish, also in German (Umlaut: Ã¤Ã¶Ã¼) and some other languages I know. When I search Pubmed or any English database, for example, the German name "FÃ¶rster" is written "Forster". I will change the result to "FÃ¶rster" in my database in order to quote the name correctly, but then I will not see the duplicates at once when I import the nest time from PubMed.

Also, when I search my database for a certain author with German Umlaut or French accent, I have to repeat my search and enter the different possible name variations into the search field, because I do not remember whether this name was entered with or without diacritic or even co-exists in both spellings.

You said, an option in the search dialog is not feasible because of the indexes used. II see two possible solutions:
1. Create two indexes, one diacritic-sensitive and one diacritic-insensitive, and use whichever is selected in the option field (preferred), or
2. Put the selection for diacritic-sensitive or diacritic-insensitive into the preferences and build the index accordingly.

Best regards
Hans-Reinhard

Post by **Jon** » Wed Oct 24, 2007 7:40 am

Hi,

Having two indexes for everything is not practical. It would almost double the size of each db (and they're large enough as it is). I don't think a preference option is a good idea (it would only apply to new databases).This doesn't seem to be a particularly contentious issue, judging by the feedback so far, so I'll look into its feasability for the next update.

Jon
Sonny Software

Post by **Jon** » Wed Oct 24, 2007 9:12 am

For those following this, I've enabled diacritic-insensitive searches in the beta of the next Bookends update. This will apply for all new databases. To enable it in a database created with a previous versions of Bookends, you simply need to rebuild it. A diacritic-sensitive search is still available if you do a Find and search by "characters".

Jon
Sonny Software

shuyi · Post by **shuyi** » Wed Oct 24, 2007 9:45 am

I have a related question, though perhaps this should be in a new thread or considered a tech support issue...

I use Chinese and Japanese source material quite frequently and have a number of these sources in Bookends. I've noticed that I am unable to consistently perform a search trying to input CJK text via the search field in the main window. Example: if I attempt to input 'yi' via the Simplified Chinese Pinyin input method in order to get a keyword that I have that is a particular Chinese character, the search will assume I want to search on the Roman characters 'yi' and will not give me a chance to select the proper Chinese character. Thus, I cannot perform the search. Occasionally, if I try to perform a similar search using a traditional Chinese input method or a Japanese input method, I can sometimes get the search to work as expected, but rarely especially in the case of traditional Chinese (it often defaults to the first character in the list that pops up after inputing, say, 'hua'; I have no chance to choose the character in question before the search is performed by Bookends).

Relatedly, when attempting to switch input methods when the cursor is in the search field in the main window, the input source will often change on its own. So if I choose, say, the simplified Chinese input method, as I start to type it will switch to a Japanese input method of its own accord. This makes it difficult to attempt any search via the search field. Perhaps this is the reason I often can't perform CJK searches?

Neither of these is a problem, however, when searching a database via Refs menu -> Find... It only happens in the search field in the main window .

So, my question is, is this a bug in Bookends? I haven't had this problem in other apps nor when performing searches via Spotlight. Just curious.

Again, if this should be posted in another thread or if this is a tech support issue, my apologies for posting it here. If you have any question or need clarification, I'd be happy to discuss this further.

Thanks,
Chris

Post by **Jon** » Wed Oct 24, 2007 9:53 am

Hi,

Yes, this should be a different thread. Perhaps you can start a new one (I can't do that for you)? I will simply point out here that the live search and the Find use the same indices, so there should be no problem there. The live search just updates in real time (which is why it "grabs" the first character). Bookends doesn't know about the input method (that's at the OS level). Perhaps someone else who deals with Asian characters in Bookends has something more useful to add.

Jon
Sonny Software

shuyi · Post by **shuyi** » Wed Oct 24, 2007 10:00 am

Jon wrote:Hi,

Yes, this should be a different thread. Perhaps you can start a new one (I can't do that for you)? I will simply point out here that the live search and the Find use the same indices, so there should be no problem there. The live search just updates in real time (which is why it "grabs" the first character). Bookends doesn't know about the input method (that's at the OS level). Perhaps someone else who deals with Asian characters in Bookends has something more useful to add.

Jon
Sonny Software

Hi Jon,

Thanks for the info. I figured it didn't have anything to do with the input method, but I thought I'd ask. As for the rest, I'll move this to a new thread.

Thanks,
Chris

Diacritics & search

Would you like your search to be sensitive to diacritics?

Diacritics & search

CJK searches