« Home | Copyright Disputes are usually Failures of Imagina... » | The power of a good visualization » | The 2007 Feltron Annual Report » | More things I wish someone else will write » | Reference Cards » | Owning my Metadata » | 50 years of Baseball Play-by-play data mashed with... » | Time Machine is neat-o, but I want a Time and Spac... » | Old-School Shop Guide » | Leveraging the Bittorrent Underground for semantic... »

Parsing Names with Honorifics

In Railscast #16, Ryan Bates goes over Virtual Attributes in Rails, using the standard example of storing first and last names but getting/setting full names. He uses the following simple snippet:


def full_name=(name)
  split = name.split(' ', 2)
  self.first_name = split.first
  self.last_name = split.last
end

Which -- given that the focus was on virtual attributes -- is fine for explanation. However, that snippet will fail on names like "Franklin Delano Roosevelt" (last name of "Delano Roosevelt"). Here's a method which our 32d President will like better:


def clean(n, re = /\s+|[^[:alpha:]\-]/)
 return n.gsub(re, ' ').strip
end

# Returns [first_name, last_name] (or '' if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name(n) 
    parts    = clean(n).split(' ')
    [parts.slice(0..-2).join(' '), parts.last]
end

names = [
    "Bill! Merkin,PhD.",
    "Jim               Thurston Howell III   ",
    "Charo", 
    "Heywood Jablowmie",
    "Sergei Rodriguez-Ivanoviv",
    "Polly Romanesq. ",
    "   ", 
    "",
    ]
p names.map { |n| first_last_from_name n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], ["", "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], ["", nil], ["", nil]]

A regex is more extensible, and makes more sense for Perl refugees like me.


# Returns [first_name, last_name] (or nil if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name_re(n)
    n = clean(n); 
    (n =~ / /) ? (n.scan(/(.*)\s+(\S+)$/).first) : [nil, n]     
end

p names.map { |n| first_last_from_name_re n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], [nil, "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], [nil, ""], [nil, ""]]

However, as someone who can't check in at the automatic kiosks in airports because -- no joke -- the credit card thinks my last name is "IV", I like this version better.


# Returns [first_name, last_name, appendix] 
# (first name and appendix are nil if there isn't any).
# Leading/trailing spaces ignored.
# 
def first_last_appendix_from_name_re(n, appendix = nil)
    n = clean(n)
    appendix_re ||= %q((I|II|III|IV|(?:jr|sr|m\.?d|esq|Ph\.?D)\.?))
    if (n !~ / /) then
        [nil, n, nil]           # with no spaces return n as last name
    else
        n.scan(
          /\A(.*?)\s+           # everything up to the last name
           (\S+?)               # last name is last stretch of non-whitespace
           (?:                  # But! there may be an appendix.  Look for an optional group
             (?:,\s*|\s+)       #   that is set off by a comma or spaces
             #{appendix_re}     #   and that matches any of our standard honorifics.
             )?                 # but if not, don't worry about it.
           \Z/ix).first         # scan gives array of arrays; \A..\Z guarantees exactly one match
    end
end

p names.map { |n| first_last_appendix_from_name_re n }
# => [["Bill", "Merkin", "PhD"], ["Jim Thurston", "Howell", "III"], [nil, "Charo", nil], ["Heywood", "Jablowmie", nil], ["Sergei", "Rodriguez-Ivanoviv", nil], ["Polly", "Romanesq", nil], [nil, "", nil], [nil, "", nil]]

All three versions might make Japanese (and other "FamilyName GivenNames" cultures) sad.

Labels: , , , , , , , , , , , , ,

Your "clean" method strips all german "Umlaute" (ä, ö, ü etc). I would be pissed to because the airport terminal would think my name is "Jrgen" instead of "Jürgen". Sorry dude, gotta make another revision.

Post a Comment