Extract state abbreviation and zip code from strings

Question

I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string

    address <- "19800 Eagle River Road, Eagle River AK 99577
              907-481-1670
              230 Colonial Promenade Pkwy, Alabaster AL 35007
              205-620-0360
              360 Connecticut Avenue, Norwalk CT 06854
              860-409-0404
              2080 S Lincoln, Jerome ID 83338
              208-324-4333
              20175 Civic Center Dr, Augusta ME 4330
              207-623-8223
              830 Harvest Ln, Williston VT 5495
              802-878-5233
              "

For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers

    text <- readLines(textConnection(address))

    library(stringi)
    zip <- stri_extract_last_regex(text, "\\d{5}")
    zip

    library(qdapRegex)
    rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
    zip <- rm_zip3(text)
    zip

    [1] "99577" "1670"  "35007" "0360"  "06854" "0404"  "83338" "4333"  "4330"  "8223"  "5495"  "5233"  NA

For the state abbreviation, I have no idea how to extract

Any help is appreciated! Thanks in advance!

Edit 1: Include phone numbers

You want to do it programatically or just by using regex ? I mean it can also be done by Notepad++ — Rahul, May 4 '17 at 17:31
Thank you @Rahul. Both would be great. At least can you show me how to do it with Notepad++? — IloveCatRPython, May 4 '17 at 17:33
@WiktorStribiżew: Thanks! I still got error with the last 2 lines "AK" "AL" "CT" "ID" NA NA — IloveCatRPython, May 4 '17 at 17:35
@IloveCatandPython Modify the {5} to {4,5} like this: states <- str_extract(text, "\\b[A-Z]+(?=\\s+\\d{5}$)") — degant, May 4 '17 at 17:39

degant · Accepted Answer · 2017-05-04 18:25:56Z

Code to extract zip code:

zip <- str_extract(text, "\\d{5}")

Code to extract state code:

states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")

Code to extract phone numbers:

phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")

NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:

Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)

zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")

Code to extract state code:

states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")

Demo: https://regex101.com/r/7Im0Mu/2

Thanks! I updated the address to include phone numbers as well. Can you modify your code accordingly? Also, looks like the state didn't work as intended [1] "AK" NA "AL" NA "CT" NA "ID" NA "ME" NA "VT" NA NA — IloveCatRPython, May 4 '17 at 17:59
Added code for phone number. I think you are getting extra NA every alternate line because now your phone number is on another line and str_extract tries to extract zip code for each line and isn't able to find any ZIP code on the 2nd line containing phone number — degant, May 4 '17 at 18:12
Sorry for not being clear! With the added phone numbers, the last 4 digits in them will also show up after extraction [1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA — IloveCatRPython, May 4 '17 at 18:22
Sorry I thought I fixed that. I updated the code and demo link now, take a look. You'll also get the same problem here like state codes though, each alternate one will be NA due to the number of lines in your text — degant, May 4 '17 at 18:27

PKumar · Accepted Answer · 2017-05-04 19:05:02Z

I am using address as input not the text, see if it works for your case.

Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.

Input:

address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"

I am using stringr library , you may choose any other to extract the information as you wish.

library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")

EDIT:

In case someone want to use text as input,

text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))

OUTPUT:

    State   Zip    pin
1    AK 99577 907-481-1670
2    AL 35007 205-620-0360
3    CT 06854 860-409-0404
4    ID 83338 208-324-4333
5    ME  4330 207-623-8223
6    VT  5495 802-878-5233

Thank you! It works now. Is it possible to modify your code to work with reading text from file or at least work with text instead of address? — IloveCatRPython, May 4 '17 at 18:40
Thanks @K..pradeeep! Much appreciated! If only I could choose more than one answer. — IloveCatRPython, May 4 '17 at 22:23

Rahul · Accepted Answer · 2017-05-04 18:16:15Z

Thank you @Rahul. Both would be great. At least can you show me how to do it with Notepad++?

Extraction using Notepad++

Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.

Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.

Now go to Search --> Bookmark --> Remove Unmarked Lines

Finally search with ^-|-$ and replace with empty string.

Update

So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.

Thank you @Rahul! It didn't work for me link (please note that I updated the address to include phone numbers as well) — IloveCatRPython, May 4 '17 at 18:03
@IloveCatandPython: So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same. — Rahul, May 4 '17 at 18:05
Notepad++ kept telling me Find: Can't find the text "([A-Z]{2}\s*\d{4,5})". Do I need to install any plugin for it? TY — IloveCatRPython, May 4 '17 at 18:09
You have to search it in Replace tab and do the replacement accordingly. Please read the steps. — Rahul, May 4 '17 at 18:10
Thanks! I didn't realized that you have to chose Regular expression in Search Mode — IloveCatRPython, May 4 '17 at 18:16

Extract state abbreviation and zip code from strings

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged r regex text-extraction zipcode or ask your own question.

Hot Network Questions

Extract state abbreviation and zip code from strings

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r regex text-extraction zipcode or ask your own question.

Related

Hot Network Questions