2

I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string

    address <- "19800 Eagle River Road, Eagle River AK 99577
              907-481-1670
              230 Colonial Promenade Pkwy, Alabaster AL 35007
              205-620-0360
              360 Connecticut Avenue, Norwalk CT 06854
              860-409-0404
              2080 S Lincoln, Jerome ID 83338
              208-324-4333
              20175 Civic Center Dr, Augusta ME 4330
              207-623-8223
              830 Harvest Ln, Williston VT 5495
              802-878-5233
              "

For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers

    text <- readLines(textConnection(address))

    library(stringi)
    zip <- stri_extract_last_regex(text, "\\d{5}")
    zip

    library(qdapRegex)
    rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
    zip <- rm_zip3(text)
    zip

    [1] "99577" "1670"  "35007" "0360"  "06854" "0404"  "83338" "4333"  "4330"  "8223"  "5495"  "5233"  NA 

For the state abbreviation, I have no idea how to extract

Any help is appreciated! Thanks in advance!

Edit 1: Include phone numbers

| improve this question | |
8

Code to extract zip code:

zip <- str_extract(text, "\\d{5}")

Code to extract state code:

states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")

Code to extract phone numbers:

phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")

NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:

Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)

zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")

Code to extract state code:

states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")

Demo: https://regex101.com/r/7Im0Mu/2

| improve this answer | |
  • Thanks! I updated the address to include phone numbers as well. Can you modify your code accordingly? Also, looks like the state didn't work as intended [1] "AK" NA "AL" NA "CT" NA "ID" NA "ME" NA "VT" NA NA – IloveCatRPython May 4 '17 at 17:59
  • Added code for phone number. I think you are getting extra NA every alternate line because now your phone number is on another line and str_extract tries to extract zip code for each line and isn't able to find any ZIP code on the 2nd line containing phone number – degant May 4 '17 at 18:12
  • Sorry for not being clear! With the added phone numbers, the last 4 digits in them will also show up after extraction [1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA – IloveCatRPython May 4 '17 at 18:22
  • Sorry I thought I fixed that. I updated the code and demo link now, take a look. You'll also get the same problem here like state codes though, each alternate one will be NA due to the number of lines in your text – degant May 4 '17 at 18:27
3

I am using address as input not the text, see if it works for your case.

Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.

Input:

address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"

I am using stringr library , you may choose any other to extract the information as you wish.

library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")

EDIT:

In case someone want to use text as input,

text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))

OUTPUT:

    State   Zip    pin
1    AK 99577 907-481-1670
2    AL 35007 205-620-0360
3    CT 06854 860-409-0404
4    ID 83338 208-324-4333
5    ME  4330 207-623-8223
6    VT  5495 802-878-5233
| improve this answer | |
2

Thank you @Rahul. Both would be great. At least can you show me how to do it with Notepad++?


Extraction using Notepad++

  1. Well first copy your whole data in a file.

  2. Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.

enter image description here

  1. Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.

enter image description here

  1. Now go to Search --> Bookmark --> Remove Unmarked Lines

Result

  1. Finally search with ^-|-$ and replace with empty string.

enter image description here


Update

So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.

| improve this answer | |
  • Thank you @Rahul! It didn't work for me link (please note that I updated the address to include phone numbers as well) – IloveCatRPython May 4 '17 at 18:03
  • @IloveCatandPython: So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same. – Rahul May 4 '17 at 18:05
  • Notepad++ kept telling me Find: Can't find the text "([A-Z]{2}\s*\d{4,5})". Do I need to install any plugin for it? TY – IloveCatRPython May 4 '17 at 18:09
  • You have to search it in Replace tab and do the replacement accordingly. Please read the steps. – Rahul May 4 '17 at 18:10
  • Thanks! I didn't realized that you have to chose Regular expression in Search Mode – IloveCatRPython May 4 '17 at 18:16

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.