Vizsage: Parsing Names with Honorifics

In Railscast #16, Ryan Bates goes over Virtual Attributes in Rails, using the standard example of storing first and last names but getting/setting full names. He uses the following simple snippet:


def full_name=(name)
  split = name.split(' ', 2)
  self.first_name = split.first
  self.last_name = split.last
end

Which -- given that the focus was on virtual attributes -- is fine for explanation. However, that snippet will fail on names like "Franklin Delano Roosevelt" (last name of "Delano Roosevelt"). Here's a method which our 32d President will like better:


def clean(n, re = /\s+|[^[:alpha:]\-]/)
 return n.gsub(re, ' ').strip
end

# Returns [first_name, last_name] (or '' if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name(n) 
    parts    = clean(n).split(' ')
    [parts.slice(0..-2).join(' '), parts.last]
end

names = [
    "Bill! Merkin,PhD.",
    "Jim               Thurston Howell III   ",
    "Charo", 
    "Heywood Jablowmie",
    "Sergei Rodriguez-Ivanoviv",
    "Polly Romanesq. ",
    "   ", 
    "",
    ]
p names.map { |n| first_last_from_name n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], ["", "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], ["", nil], ["", nil]]

A regex is more extensible, and makes more sense for Perl refugees like me.


# Returns [first_name, last_name] (or nil if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name_re(n)
    n = clean(n); 
    (n =~ / /) ? (n.scan(/(.*)\s+(\S+)$/).first) : [nil, n]     
end

p names.map { |n| first_last_from_name_re n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], [nil, "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], [nil, ""], [nil, ""]]

However, as someone who can't check in at the automatic kiosks in airports because -- no joke -- the credit card thinks my last name is "IV", I like this version better.


# Returns [first_name, last_name, appendix] 
# (first name and appendix are nil if there isn't any).
# Leading/trailing spaces ignored.
# 
def first_last_appendix_from_name_re(n, appendix = nil)
    n = clean(n)
    appendix_re ||= %q((I|II|III|IV|(?:jr|sr|m\.?d|esq|Ph\.?D)\.?))
    if (n !~ / /) then
        [nil, n, nil]           # with no spaces return n as last name
    else
        n.scan(
          /\A(.*?)\s+           # everything up to the last name
           (\S+?)               # last name is last stretch of non-whitespace
           (?:                  # But! there may be an appendix.  Look for an optional group
             (?:,\s*|\s+)       #   that is set off by a comma or spaces
             #{appendix_re}     #   and that matches any of our standard honorifics.
             )?                 # but if not, don't worry about it.
           \Z/ix).first         # scan gives array of arrays; \A..\Z guarantees exactly one match
    end
end

p names.map { |n| first_last_appendix_from_name_re n }
# => [["Bill", "Merkin", "PhD"], ["Jim Thurston", "Howell", "III"], [nil, "Charo", nil], ["Heywood", "Jablowmie", nil], ["Sergei", "Rodriguez-Ivanoviv", nil], ["Polly", "Romanesq", nil], [nil, "", nil], [nil, "", nil]]

All three versions might make Japanese (and other "FamilyName GivenNames" cultures) sad.

Labels: appendix, attributes, honorific, jr, match, MD, name, parse, rails, regex, ruby, sr, virtual, whitespace

Posted by flip on Sunday, January 27, 2008 at 1/27/2008 06:34:00 PM | Permalink

Vizsage

Building tools to help organize, explore and visualize massive raw information streams

Parsing Names with Honorifics

Search

Previous posts

Archives

Links

About me