Using regex to match specific numbers of sub-directories in a URL can be very helpful for Google Analytics. When I configure a new Google Analytics view, I'll usually set up Content Grouping so we can see traffic by page type rather than just to a specific page. Ideally there's a value in the data layer that we can use for this purpose; failing that, I look for a certain keywords in the URL in order to use GA's "Group using rule definitions" functionality. For example, if all the blog pages are grouped into a sub-directory called /blog/, it's easy enough to add a rule definition like "Page contains /blog/". This also applies to GA's URL destination goal setup, which also accepts string matches.
Unfortunately real life scenarios are often not that clean. There are many cases where there's neither a data layer value nor specific keywords in the URL. In those cases there's another potential approach: count up the number of sub-directories and match on those with a regular expression (regex). For example, an e-commerce site may have URLs like www.site.com/clothing/jeans/low-rise-jeans-12345A/. In that case you could use some logic like, 3 sub-directories = product details page, two sub-directories = sub-category page, 1 sub-directory = main category page.
This post will provide the regex for matching specific numbers of sub-directories in a URL path, for a few different cases.
VARIATION 1: EXACTLY X NUMBERS OF SUB-DIRECTORIES, WITH TRAILING SLASH
VARIATION 2: NO TRAILING SLASH
VARIATION 3: AT LEAST X NUMBER OF SUBDIRECTORIES
VARIATION 4: PATH SEGMENTS STARTING WITH A NUMBER
VARIATION 1: EXACTLY X NUMBERS OF SUB-DIRECTORIES, WITH TRAILING SLASH
This variation assumes each subdirectory ends in a trailing slash.
Regex for exactly one sub-directory
^/[^/]+/$
example matching URL path: /retail/
Regex for exactly two sub-directories
^/[^/]+/[^/]+/$
example matching URL path: /retail/clothing/
Regex for exactly three sub-directories
^/[^/]+/[^/]+/[^/]+/$
example matching URL path: /retail/clothing/jeans/
Regex for exactly four sub-directories
^/[^/]+/[^/]+/[^/]+/[^/]+/$
example matching URL path: /retail/clothing/jeans/low-rise-jeans-12345A/
(and so on...)
VARIATION 2: NO TRAILING SLASH
The above works in the case that all URLs end in a trailing slash. If they don't, the regex needs to be altered as follows.
Regex for exactly one sub-directory + text
^/[^/]+/[^/]+[a-zA-Z0-9]$
example matching URL path: /retail/clothing
Regex for exactly two sub-directories + text
^/[^/]+/[^/]+/[^/]+[a-zA-Z0-9]$
example matching URL path: /retail/clothing/jeans
(and so on...)
VARIATION 3: AT LEAST X NUMBER OF SUBDIRECTORIES
If you want to match AT LEAST some number of sub-directories, just remove the initial caret. So regex to match at least one sub-directory would be: /[^/]+/$, regex to match at least two sub-directories would be /[^/]+/[^/]+/$, etc.
VARIATION 4: PATH SEGMENTS STARTING WITH A NUMBER
This is useful for the case where you have URL path segments that always start with a number. I see this a lot with product detail pages.
Regex for 1 subdirectory + path segment starting with a number
^/[^/]+/[^/][0-9]
example matching URL path: /t-shirt/1234567-a-new-design
TEST VIA THE ALL PAGES REPORT
Before updating your Content Grouping or Goal Destination settings, test your regex condition. To do this, navigate to Behavior > Site Content > All Pages and click "advanced".
Once the advanced filter opens up, enter your regex condition like this:
Click Apply and manually verify that the results look as expected.
Hi, thanks for writing out this article. It's been super helpful and is almost exactly what I'm looking for.
However, I'd like to take "Variation 1" one step further.
With the example, Regex for exactly one sub-directory:
^/[^/]+/$
This will match any top-level directory, for example: /retail/.
I'd like to do this, plus 2 and 3 directories deeper, but ideally I want to specify exactly what that directory is. In this example, matching only the /retail/ directory, and all subsequent subdirectories.
Would love if you'd be able to explain that!
That's great, glad to hear it was useful!
If you want to specify a specific directory, you should be able to type it in like this:
^/(retail)+/[^/]+/$ (2 subdirectories, including /retail/)
^/(retail)+/[^/]+/[^/]+/$ (3 subdirectories, including /retail/)
^/(retail)+/[^/]+/[^/]+/[^/]+/$ (4 subdirectories, including /retail/)
Please try that and let me know if there's any issue.
Hi there! Thanks so much for this article. It's incredibly helpful. I'm trying the above without luck. In my example there is an underscore in the first subdirectory I'd like to group the content by.
Example:
Goal: group all content containing two sub directories after /retail_store/, beginning with /retail_store/
^/(retail_store)+/[^/]+/$
Any idea what I might be doing wrong?
Thank you for your comment, much appreciated!
If you have 2 subdirectories after /retail_store/, you'll have a total of 3 subdirectories. So for that case you should use the following:
^/(retail_store)+/[^/]+/[^/]+/$
It shouldn't matter if there's an underscore or not. Please try it out and let me know how it goes!
Thanks so much, Ana! I'm still having some issues with this.
For context, I'm trying to group content in GA using this regex. I want to group views to two different sub-directories:
1. https://website.com/retail_store/123
2. http://website.com/retail_store/123/456
Pages with exactly two subdirectories starting with retail_store is one category, and pages with exactly three subdirectories starting with retail_store is the second category. Any other advice you might have would be awesome - I so appreciate your help!
I think the problem is that your URLs don't have trailing slashes, so it's actually either 1 subdirectory or 2 subdirectories plus a path segment. Can you try the following?
1. https://website.com/retail_store/123 ==> ^/(retail_store)+/[^/]+[a-zA-Z0-9]$
2. http://website.com/retail_store/123/456 ==> ^/(retail_store)+/[^/]+/[^/]+[a-zA-Z0-9]$
Hey there! Just realized I never received the notification for this. Your suggestion works! I'll do some reading up around trailing slashes to make sure I don't run into this issue again in the future. 🙂
Super helpful post! Do you also have something in mind for filtering a Landing Page URL based on the amount of certain symbols. E.x. filter out URLs that countain 3 or more "_" ?
I think you can use the following regex to capture URLs with 3 or more underscores:
\_.*\_.*\_
Please check it out and let me know if that works for you!
Thank you so much Ana.
Can you help me with this - If my website has following structure http://www.example.com/abc/asddf, http://www.example.com/def/gfgs, http://www.example.com/xyz/asds etc. along with other multiple sub-folders like http://www.example.com/pqr/abc/faasd.
I want to extract all URLs that contain /abc/, /def/ and /xyz/ exacting after root folder. What regex should I be using?
Thanks for help.
Hey Himanshu! Thank you for the comment.
In this case you should be able to make a regex like this:
/abc/|/def/|/xyz/
(the pipe character means OR).
Does that work for you?
It worked! thanks alot!
Hi,
Thanks a lot for this very useful article. I am anyway struggling to properly group the pages of my website.
http://www.mywebsite.com/20301/landingpage
http://www.mywebsite.com/20301/usermanagement
I would like to group by the last subdirectory, in this example landingpage and usermanagement, ignoring the 20301 which is the customer id.
Any suggestion on how to do that?
Thanks a lot in advance,
Roberto
Hi Roberto, if you just want to group by those specific pages in GA I don't think you need regex at all, you can just make conditions like Page contains /landingpage and Page contains /usermanagement. Let me know if I've misunderstood what you're looking for.
Hi Ana,
Thanks a lot for you feedback. I tried to follow your suggestion but it doesn't work anyway.
What could I do?
Hey Roberto, do you have any more detail about what you're looking for?
Hi Ana - great article! Question.
Is there a way to grab the beginning and end by required URL string, but disregard the amount of subdirectory paths?
Example: website.com/abc/d-e/123 and website.com/abc/123
I want to grab the URL string if both /abc/ and /123 are True, regardless of how many subdirectory paths there are. Does this make sense? Thanks in advance for any direction you can provide!
Hey AF, sure, you can use .* as a wildcard.
So if you want to match on /abc/ and /123, you'd use the following expression: /abc/.*123
(I left the slash off the 123 so the regex will work even if it's immediately following /abc/.You can test it in your All Pages report to make sure it pulls in the URLs you're looking for.)
This worked - and I'm kicking myself for not seeing it. Thank you!!
Hi Ana. And here is the regex to match no directories: http://www.mywebsite.com/page.aspx
^\/[^\/]+\.aspx
Thank you for sharing! Though, your regex will only work for the specific case where the URL ends with some known string (like your '.aspx' example). If you want something a little more generalizable I think you could use this to match no directories: ^/[^/]+[a-zA-Z0-9]$
I am trying to set up goals in ga and want to track the following way
Suppose I have http://www.abc/Category/Subcategory/Products1
http://www.abc/Category/Subcategory/Products2
http://www.abc/Category/Subcategory-2/Product-1
http://www.abc/Category/Subcategory-2/Product-2
I want to track Only Category , Subcategory , Products
I want to Track Home Page --> Category Pages ---> Subcategory Pages ---> Products To be configured as goals in Ga . I need to know in my destination url how Do I do this so I can include multiple categories excluding sub then
multiple subs excluding products or categories and like wise
I'm not sure I totally understand your question, but GA doesn't support negative lookahead regex, so in general you need to choose simple conditions that include the text you want, and exclude the text you don't. So in your example, a regex condition of "/category/" would naturally exclude "/subcategory-2/" since the string doesn't include "-2" in it.
Thanks a bunch - exactly what i needed.
Thank you so much for this article! I had a few headaches getting my head around how to do this properly, but thanks to your examples I've managed to get a clean grouping in Google Analytics.
However... the headaches are starting again now I want to use it in Datastudio. I've changed the regex to match datastudio's regex version (Google RE2). However, the regex doesn't exclude url strings with more than 3 directories. It now works as a minimum of 3 directories. I'd like to exclude url strings with 4 or more directories, so I only have the url strings with 3 directories.
Do you happen to have an idea how to do this in Google RE2 for Datastudio?
Thanks for your help! 🙂
Turns out I made a mistake. It works fine, probably stared at this too long at once. Even more thanks now!
Hello Ana,
This is really helpful.
Thanks a lot
Ivan