Golang: Regex
Introduction
In this 26th part of the series, we will be covering the basics of using regular expressions in golang. This article will cover the basic operations like matching, finding, replacing, and sub-matches in a regular expression pattern from string source or file content. This will have examples for each of the concepts and similar variants will follow the same ideology in self-exploring the syntax.
Regex in golang
So, let’s start with what are regular expressions.
Regular expressions are basic building blocks for searching, pattern matching, and manipulation from a source of text using computational logic.
This is not the formal definition, but have written it in words for my understanding of regular expressions till now. Might not be accurate, but makes sense to me after I had played and explored it (not fully).
So regular expressions use some pattern-matching techniques using basic logic operators like concatenation, quantifiers, etc. These relate to the study of the theory of computation quite closely but you don’t need to get into too much theory in order to understand the working of regular expressions. However, it won’t harm you if you are curious about it and want to explore it further.
Some resources to learn the fundamentals of regular expressions:
Regexp package
We will be using the regexp package in the golang standard library to get some quick and important methods for quick and neat pattern matching and searching. It provides a Regexp
type and a lot of methods on top of it to perform matching, finding, replacing, and sub-matches in the source text.
This package also supports two types of methods serving different purposes and use cases for string and slice of bytes, this can be useful for reading from a buffer, file, etc., and also flexible enough to search for a simple string.
Matching Patterns
One of the fundamental aspects of the regular expression is to check if a particular pattern is present or not in a source string. The regexp
package provides some methods like Match, MatchString methods on a slice of bytes and string respectively from the pattern string.
Matching Strings
The basic operations with regex or regular expression can be performed to compare and match if the pattern matches a given string.
In golang, the regexp package provides a few functions to simply match expressions with strings or text. One of the easy-to-understand ones include the MtachString, and Match methods.
package main
import (
"log"
"regexp"
)
func log_error(err error) {
if err != nil {
log.Fatal(err)
}
}
func main() {
str := "gophers of the goland"
is_matching, err := regexp.MatchString("go", str)
log_error(err)
log.Println(is_matching)
is_matching, err := regexp.MatchString("no", str)
log_error(err)
log.Println(is_matching)
}
$ go run main.go
true
false
In the above code, we have used the MatchString
method that takes in two parameters string/pattern to find, and the source string. The function returns a boolean as true
or false
if the pattern is present in the source string, also it might return an error if the pattern(first parameter) is parsed as an incorrect regular expression.
So, we can clearly see, the string go
is present in the string gophers of the goland
and the string no
is not a substring.
We also have Match
method which is a more general version of MatchString
it excepts a slice of byte rather than a string as the source string. The first parameter is still a string, but the second parameter is a slice of bytes.
is_matching, err = regexp.Match(`.*land`, []byte("goland is a land of gophers"))
log_error(err)
log.Println(is_matching)
$ go run main.go
true
We can use the Match
method to simply parse a slice of bytes to use as the source text to check the pattern.
package main
import (
"log"
"os"
"regexp"
)
func log_error(err error) {
if err != nil {
log.Fatal(err)
}
}
func main() {
file_content, err := os.ReadFile("temp.txt")
log_error(err)
is_matching, err = regexp.Match(`memory`, file_content)
log_error(err)
log.Println(is_matching)
is_matching, err = regexp.Match(`text `, file_content)
log_error(err)
log.Println(is_matching)
}
# temp.txt
One of the gophers used a slice,
the other one used a arrays.
Some gophers were idle in the memory.
$ go run main.go
true
false
We can now even parse the contents of a file as a slice of bytes. So, it would be really nice to compare and check for a string pattern in a file quickly. Here in the above example, we have checked if memory
is present in the text which it is, and in the second call, we check if text
string is present anywhere in the file content which it is not.
Find Patterns
We can even use regular expressions for searching text or string with the struct/type Regexp provided by golang’s regexp. We can create a regular expression and use other functions like MatchString
, Match
, and others that we will explore to match or find a pattern in the text.
Find String from RegEx
We can get a slice of byte from the FindAll
method which will take in a slice of byte, the second parameter as -1 for all matches. The function returns a slice of slice of byte with the byte representation of the matched string in the source text.
exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)
match := exp.FindAll(pincode_file, -1)
log.Println(match)
# pincode.txt
Pincode: 12345-1234
City, 40084
State 123
$ go run main.go
[[49 50 51 52 53 45 49 50 51 52] [52 48 48 56 52]]
In the above example, we have used the Compile method to create a regular expression and FindAll to get all the occurrences of the matching patterns in the text. We have again read the contents from the file. In this example, the exp
is a regular expression for a postal code, which can either have 5 digit or 5digits-4digit combination. We read the file pincode.txt
as a slice of bytes and use the FindAll
method. The FindAll method takes in the parameter as a slice of byte and the integer as the number of occurrences to search. If we use a negative number it will include all the matches.
We search for the pin code in the file and the funciton returns a list of bytes that match the regular expression in the provided object exp
. Finally, we get the result as 12345-1234
and 40084
which are present in the file. It doesn’t match the number 123
which is not a valid match for the given regular expression.
There is also a version of FindAll
as FindAllString
which will take in a string as the text source and return a slice of strings.
matches := exp.FindAllString(string(pincode_file), -1)
log.Println(matches)
$ go run main.go
["12345-1234" "40084"]
So, the FindAllString
would return a slice of strings of the matches in the text.
Find the Index of String from RegEx
We can even get the start and end index of the matched string in the text using the FindIndex and FindAllIndex methods to get the indexes of the match, or all the matches of the file content.
exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)
match_index := exp.FindIndex(pincode_file)
log.Printf("%T\n", match_index)
log.Println(match_index)
$ go run main.go
[]int
[9 19]
The above code is using the FindIndex
method to get the indexes of the first match of the regular expression. The output of the code gives a single slice of an integer with length two as the start and last index of the matched string in the text file. So, here, the 9
represents the position(index) of the first character match in the string, and 19
is the last index of the matched string.
Pincode: 12345-1234
01234567890123456789
Assume 0 as 10 after 9, 11, and so on
It starts from the 9th character as `1` and ends at the `4` character
at position 18 but it returns the end position + 1 for the ease of slicing
City, 40084
012345678901
State 123
234567890
The character at 9 and 18 are the first and the last character position/index of the source string, so it returns the end position + 1 as the index. This makes retrieval of slicing source string much easier, as we won’t be offset by one.
If we want to get the text in the source string, we can use the slicing as:
exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)
match_index := exp.FindIndex(pincode_file)
if len(match_index) > 0 {
// Get the slice of the original string from start to end index
sliced_string := pincode_file[match_index[0]:match_index[1]]
log.Printf("%q\n", sliced_string)
}
$ go run main.go
"12345-1234"
So, we can access the string from the source text without calling other functions, simply slicing the original string will retrieve the expected results. This is the convention of the golang standard library to make the end index exclusive.
Similarly, the FindAllIndex
method can be used to get a list of list of such indexes of matched strings.
exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)
match_indexes := exp.FindAllIndex(pincode_file)
log.Printf("%T\n", match_indexes)
log.Println(match_indexes)
$ go run main.go
[][]int
[[9 19] [26 31]]
The above example gets the list of slices as for all the search pattern indexes in the source string/text. We can iterate over the list and get each matched string index.
Find Submatch
the regexp
package also has a utility function for finding the sub-match of a given regular expression. The method returns a list of strings (or slices of bytes) containing the leftmost match and the sub-matches in that match. This also has the All
version which instead of returning a single match i.e. the leftmost match it returns all the matches and the corresponding sub-matches.
package main
import (
"log"
"regexp"
)
func main() {
str := "abc@def.com is the mail address of 8th user with id 12"
exp := regexp.MustCompile(
`([a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,})` +
`|(email|mail)|` +
`(\d{1,3})`)`,
)
match := exp.FindStringSubmatch(str)
log.Println(match)
matches := exp.FindAllStringSubmatch(str, -1)
log.Println(matches)
}
$ go run main.go
[abc@def.com abc@def.com ]
[[abc@def.com abc@def.com ] [mail mail ] [8 8] [12 12]]
The above example uses a regex to compute a few things like email address, words like mail
or email
, also a number of up to 3 digits. The |
between these expressions indicates any combination of these three things would be matched for the regular expression. The FindStringSubmatch method, takes in a string as the source and returns a slice of the matching pattern. The first element is the leftmost match in the string source for the given regular expression, and the subsequent elements are the sub-matches in the matched string.
We can now move a little step ahead for actually understanding the sub-match in a regular expression.
str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
`([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
`|(mail|email)` +
`|(\d{1,3})`,
)
match := exp.FindStringSubmatch(str)
log.Println(match)
matches := exp.FindAllStringSubmatch(str, -1)
log.Println(matches)
$ go run main.go
[abe21@def.com abe21@def.com 21 ]
[[abe21@def.com abe21@def.com 21 ] [mail mail ] [8 8] [124 124]]
In the above example, there are a few things to take away, let us break it down into small pieces. We have a regex for matching either a mail address, the word mail
or email
, or a number up to 3 digits. The regex is a bit different from the previous example for understanding the sub-matches within an expression. We find the sub-matches in the string which will be handled by the FindStringSubmatch
. This function takes in a string and returns a list of strings that are the leftmost matches in the source string.
First, we need to understand the regex to get a clear idea of the code snippet. The first sub-match is for the email address. However, we use [a-zA-Z]
in the picking of the username and the domain name, we don’t want to directly match numbers yet. The goal of this regex is to pick up numbers inside an email address. So, we can have 1 or more characters [a-zA-z]+
, followed by 0 or more digits (\d*)
, and again we can have 0 or more characters [a-zA-Z]*
. The +
is for 1 or more, *
is for 0 or more, \d
is for digits. After this, we have the @
as a compulsory character in the mail, followed by the same sequence in the username i.e. 1 or more characters, 0 or more digits, 0 or more characters. Finally, we have the .com
and the domain name extensions, as a group of 2 or more characters [a-z|A-Z]{2,}
.
So, the regex accepts an email, with a sub-match of the number anywhere in the username or the domain name.
The FindStringSubmatch
function lists out the sub-matches for the leftmost(first) match of the regex in the source string. So it finds the string abc21@def.com
which is the email id. This regex ([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]+(\d*)[a-zA-Z]*\.[a-z|A-Z]{2,})
has two sub-matches \d*
in the username part and also in the domain part. So, the list returns the email address as the match and the match as itself, as well as the number in the sub-match inside the mail address. So the result, [abc21@def.com abc21@def.com 21 ]
, there are a few empty string sub-matches because the second sub-match for the domain name number returns an empty string.
Similarly, the FindAllStringSubmatch
will return the list of all the matches in the source string. The other matches don’t have any sub-matches in the regular expression so it just gets the match and the sub-match as itself in the case of string mail
, digit 8
, and digit 124
.
We can also use this example from a file as a slice of bytes. This will return a list of slice of slice of bytes instead of slice of slice of strings.
package main
import (
"log"
"regexp"
)
func main() {
exp := regexp.MustCompile(
`([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
`|(mail|email)` +
`|(\d{1,3})`,
)
email_file, err := os.ReadFile("subtext.txt")
log_error(err)
mail_match := exp.FindSubmatch(email_file)
log.Printf("%s\n", mail_match)
mail_matches := exp.FindAllSubmatch(email_file, -1)
//log.Println(mail_matches)
log.Printf("%s\n", mail_matches)
}
# submatch.txt
abc21@def.com is the mail address of user id 1234
The email address abe2def.com is of user name abc
a2be.@def.com
Email address: abe@de2f.com, User id: 45
johndoe@example.com
jane.doe123@example.com
janedoe@example.co.uk
john123@example.org
janedoe456@example.net
$ go run main.go
[][]unit8
[abc21@def.com abc21@def.com 21 ]
[][][]unit8
[
[abc21@def.com abc21@def.com 21 ] [mail mail ] [123 123] [4 4]
[email email ] [2 2] [2 2] [mail mail ]
[abe@de2f.com abe@de2f.com 2 ] [45 45]
[johndoe@example.com johndoe@example.com ]
[doe123@example.com doe123@example.com 123 ]
[janedoe@example.co janedoe@example.co ]
[john123@example.org john123@example.org 123 ]
[janedoe456@example.net janedoe456@example.net 456 ]
]
As we can see from the dummy email ids and some random text, we are able to match the email ids and the numbers in them as the sub-match of the string. This is a [][]unit8
in case of the FindSubmatch
and [][][]unit8
in case of FindAllSubmatch
. The working remains the same for the bytes as it was for the strings.
Find Submatch Index
We also have the FindSubmatchIndex and FindAllSubmatchIndex, and the string variants to get the index(es) of the sub-matches in the pattern picked from the regular expression.
str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
`([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
`|(mail|email)` +
`|(\d{1,3})`,
)
match := exp.FindStringSubmatch(str)
match_index := exp.FindStringSubmatchIndex(str)
log.Println(match)
log.Println(match_index)
$ go run main.go
[abe21@def.com abe21@def.com 21 ]
[0 13 0 13 3 5 9 9 -1 -1 -1 -1]
So this returns a list of pairs, here consider the list as [(0, 13) (0, 13), (3, 5), (9, 9), (-1, -1), (-1, -1)]
. Because the list represents the start and end indexes of the sub-matches in the source string. The match is the first pair, i.e. at the first character 0
-> a
, and ends at space i.e. 13
character off by one. Then we have the sub-match itself with the same indexes. Then the 3
and 5
indicating the number sub-match 21
in the abc21@def.com
it starts at 3 and ends at 5, so it occupies 3 and 4 characters in the source string. Similarly, the domain level number doesn’t have any number in the source string so it has returned the domain name end index as an empty string.
We had used (\d*)
which can return 0 or more occurrences of the digit, so it returned no match in the case of the domain level name, hence we get the 9
as an empty string at the end of the sub-match for it. The rest are for the email
or mail
, and the lone digit sub-matches in the regular expression which we don’t have in the first match in the source string.
str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
`([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
`|(mail|email)` +
`|(\d{1,3})`,
)
match := exp.FindAllStringSubmatch(str, -1)
match_indexes := exp.FindAllStringSubmatchIndex(str, -1)
log.Println(match_indexes)
$ go run main.go
[[abe21@def.com abe21@def.com 21 ] [mail mail ] [8 8] [124 124]]
[
[0 13 0 13 3 5 9 9 -1 -1 -1 -1]
[21 25 -1 -1 -1 -1 -1 -1 21 25 -1 -1]
[37 38 -1 -1 -1 -1 -1 -1 -1 -1 37 38]
[54 56 -1 -1 -1 -1 -1 -1 -1 -1 54 57]
]
Similarly, we have used the FindAllStringSubmatchIndex for getting the list of a slice of indexes(int) for the sub-match of the expression in the source string.
The first element is the same as the previous example, the next match in the source string is mail
which comes at index 21 and ends at the 24th index which golang does 24+1 as a convention. Similarly, the number 8
is matched at index 37
and the number 124
at index 54
, rest of the sub-matches for these matches are not present so it turns up to be -1 for the rest of the sub-matches.
So this can also be used for the byte/uint8 type of variants with FindSubmatchIndex and FindAllSubmatchIndex.
Replace Patterns
The replace method is used for replacing the matched patterns.
The ReplaceAll and ReplaceAllLiteral with some string and byte slice variations can help us in replacing the source text with a replacement string against a regular expression.
Let’s start with a simple example of strings.
package main
import (
"log"
"regexp"
)
func main() {
str := "Gophy gophers go in the golang grounds"
exp := regexp.MustCompile(`(Go|go)`)
replaced_str := exp.ReplaceAllString(str, "to")
log.Println(replaced_str)
}
The above code can replace the string with regex with the replacement string. The regex provided in the exp
variable has a pattern matching for Go
or go
. So, the ReplaceAllString
method takes in two arguments as strings and returns the string after replacing the character.
$ go run replace.go
Original: Gophy gophers go in the golang grounds
Replaced: tophy tophers to in the tolang grounds
So, this replaces all the characters with Go
or go
in the source string with to
.
There is a special value available in the replacement string which can expand the regular expression literals. Since the regular expression consists of multiple literals/quantifiers, we can use those to replace or keep the string in a particular expression’s evaluation.
str = "Gophy gophers go in the golang grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllString(str, "hop"))
log.Println(exp2.ReplaceAllString(str, "$1"))
log.Println(exp2.ReplaceAllString(str, "$2"))
log.Println(exp2.ReplaceAllString(str, "$3"))
$ go run replace.go
hopphy hophop hop in the hoplang ghop
Gophy go go in the golang g
phy phers in the lang g
phy in the lang grounds
The above code uses the regular expression match string have been replaced with the literals. So, the regex (Go|go)|(phers)|(rounds)
has three parts Go|go
, phers
, and rounds
. Each of the literals is separated by an operator, indicating either the match or all matches should be considered.
In the first statement, we replace the regex with hop
as you can see all the matches are replaced with the replacement string. For instance, the word gophers
is replaced by hophop
, because go
and phers
is matched separately and replaced.
In the second statement, we replace the source string with $1
indicating the first literal in the regex i.e. Go|go
. These statements will expand the $1
and keep the match only where the literal matches and rest are removed. So, Gophy
is matched with Go|go
so it is kept as is in the replacement. However, for grounds
, the literal match for $1
i.e. Go|go
does not match and hence is not kept and removed, so the resulting string becomes g
rest is substituted with an empty string.
In the third print statement, the source string is replaced with the second literal $2
i.e. phers
. So if any string matches the literal, only those are kept and the rest are substituted by an empty string. So, Gophy
or Go
doesn’t match phers
and hence is replaced, however, gophers
match the phers
part and is kept as it is, but the go
part is substituted.
Similarly, for the fourth print statement, the source string is replaced with the third literal i.e. rounds
. So if only the third literal is kept as is, the rest matching strings from the regex are substituted with an empty string. So, grounds
remain as it is because it matches the rounds
in the replacement literal.
In short, we replace the literal back after replacing the regex patterns in the source string. This can be used to fine-tune or access specific literals or expressions in the regular expressions.
str = "Gophy gophers go in the golang grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllString(str, "$1$2"))
str = "Gophy gophers go in the golang cophers grounds"
log.Println(exp2.ReplaceAllString(str, "$1$3"))
$ go run replace.go
Gophy gophers go in the golang g
Gophy go go in the golang co grounds
We can even concatenate and make some minor string literal adjustments to the replacement string. As we have done in the example, where both the $1$2
are used as the replacement string. The two literals combine to make a string with Go|go
and phers
. So, we can see the result, in the first statement, Gophy gophers go in the golang g
, all the character that have Go|go
or phers
are kept as it is(substituted the same string), the grounds
however does not match and hence are replaced with the empty string(because the capture group $1$2
does not match rounds
).
Similarly, for the third example, the $1$3
i.e. Go|go
or rounds
are matched with the source string. So, we the phers
in gophers
and cophers
does not match the capture group $1$3
and hence is replaced by an empty group(string). However, the Gophy
, golang
, and grounds
match the capture group and are replaced by that match string (which is the same string).
If we want to avoid expansion of the strings as we did in the previous example with $1
and other parameters, we can use the ReplaceAllLiteral or ReplaceAllLiteralString to parse the string as it is.
str := "Gophy gophers go in the golang cophers grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllLiteralString(str, "$1"))
$ go run replace.go
$1phy $1$1 $1 in the $1lang co$1 g$1
As we can see the $1
is not expanded and parsed as it is for replacing the pattern in the regular expression. The Go
is replaced with $1
to look like $1phy
, and similarly for the rest of the patterns.
That’s it from the 26th part of the series, all the source code for the examples are linked in the GitHub on the 100 days of Golang repository.
Conclusion
This article covered the fundamentals of using the regexp
package for working with regular expressions in golang. We explored the methods and Regexp
type in the package with various methods available through the type interface. By exploring the examples and simple snippets, various ways for pattern matching, finding, and replacing were walked over and found.
So, hopefully, the article might have found useful to you, if you have any queries, questions, feedback, or mistakes in the article, you can let me know in the discussion or on my social handles. Happy Coding :)