Friday, May 28, 2021

processing middle names correctly is my middle name

The Tears Of The Giraffe post reminded me of a thing I mentioned on Twitter a while back: the difficulty of accurately categorising (in terms of alphabetical sorting) people whose authorial name is in three bits, like, for instance, Alexander McCall Smith, or, for that matter, Iain Duncan Smith, who actually is - no, seriously - a published novelist

Even in that tweet I have made an unwarranted assumption, which is that if you have someone who professionally goes by three names with no intervening hyphens, names two and three are a sort of combined surname. This is demonstrably untrue for names such as Joyce Carol Oates, double featuree on this very blog - to be fair if you read the thread under the original tweet I do acknowledge as much there. Below is a not-necessarily-exhaustive list of potentially problematic authorial names from my own bookshelves.
Author name Middle bit What is it?
Alice Thomas Ellis Thomas probably surname (pseudonym)
Isaac Bashevis Singer Bashevis probably surname (partial pseudonym)
Mario Vargas Llosa Vargas Spanish patronymic
John Kennedy Toole Kennedy middle name
Gabriel García Márquez Garcia Spanish patronymic
F. Scott Fitzgerald Scott middle name
Alexander McCall Smith McCall surname
Joyce Carol Oates Carol middle name
Lewis Grassic Gibbon Grassic probably surname (pseudonym)
Ruth Prawer Jhabvala Prawer surname
David Foster Wallace Foster middle name
Bobbie Ann Mason Ann middle name
Michael Marshall Smith Marshall surname
Brett Easton Ellis Easton middle name
M. John Harrison John middle name

It's important to point out that these are my best guesses based on some cursory scanning of Wikipedia pages for the authors in question and the application of what seems to me like common sense but could very possibly not be. So for Alice Thomas Ellis, for instance, there isn't much real-world context to go on, since her real name was either Ann Margaret Haycraft née Lindholm or Anna Margaret Haycraft née Lindholm, depending on whether you believe her Wikipedia page or her obituary. But since Thomas is an unlikely middle name for a woman I have chosen to assume it's meant to be the first part of a two-part surname. 

Since I am quite lazy I wanted to have a single rule I could apply to all authors, and so the only rational one seemed to be to file the books by the third part of the name. This avoids things which seem obviously wrong, like filing Joyce Carol Oates under "C", but does produce some results which are wrong the other way, like filing Gabriel García Márquez under "M" rather than "G" and Alexander McCall Smith under "S" rather than "M". But it's the best system I have.

These relatively minor problems are just a tiny microcosm of the problems that can be faced by anyone trying to categorise people's names, particularly those trying to design IT systems and databases that store and display them. This article has an excellent list of the Everything You Think You Know Is Wrong variety and this article builds on it with some real-world examples. (some more can be found here). For all that these are excellent things to remember, especially if your book collection contains anything by Colette or Voltaire, it doesn't necessarily mean that you should abandon the approach of storing "given name" and "family name" as two separate fields, as this is phenomenally useful for the vast majority of Western names, just that you need to design in enough flexibility to store what you might, in your blinkered Western-culture-centric way, consider to be "non-standard" names. 

And, I'd hope it goes without saying, while you're being all woke and accepting inputs from a gazillion different character sets, don't forget to sanitise your inputs.

