Regex to Match Number of Subdirectories in a URL

Using regex to match specific numbers of sub-directories in a URL can be very helpful for Google Analytics. When I configure a new Google Analytics view, I’ll usually set up Content Grouping so we can see traffic by page type rather than just to a specific page. Ideally there’s a value in the data layer that we can use for this purpose; failing that, I look for a certain keywords in the URL in order to use GA’s “Group using rule definitions” functionality. For example, if all the blog pages are grouped into a sub-directory called /blog/, it’s easy enough to add a rule definition like “Page contains /blog/”.  This also applies to GA’s URL destination goal setup, which also accepts string matches.

Unfortunately real life scenarios are often not that clean. There are many cases where there’s neither a data layer value nor specific keywords in the URL. In those cases there’s another potential approach: count up the number of sub-directories and match on those with a regular expression (regex). For example, an e-commerce site may have URLs like www.site.com/clothing/jeans/low-rise-jeans-12345A/. In that case you could use some logic like, 3 sub-directories = product details page, two sub-directories = sub-category page, 1 sub-directory = main category page.

This post will provide the regex for matching specific numbers of sub-directories in a URL path, for a few different cases.

VARIATION 1: EXACTLY X NUMBERS OF SUB-DIRECTORIES, WITH TRAILING SLASH
VARIATION 2: NO TRAILING SLASH
VARIATION 3: AT LEAST X NUMBER OF SUBDIRECTORIES
VARIATION 4: PATH SEGMENTS STARTING WITH A NUMBER

API CONNECTOR ADD-ON FOR GOOGLE SHEETS

Check out my new API Connector Add-on to import data from thousands of platforms (e.g. Shopify, Harvest, Mailchimp, ActiveCampaign, VWO, YouTube, etc.) directly into Google Sheets.

VARIATION 1: EXACTLY X NUMBERS OF SUB-DIRECTORIES, WITH TRAILING SLASH

This variation assumes each subdirectory ends in a trailing slash.

Regex for exactly one sub-directory

^/[^/]+/$

example matching URL path: /retail/

Regex for exactly two sub-directories

^/[^/]+/[^/]+/$

example matching URL path: /retail/clothing/

Regex for exactly three sub-directories

^/[^/]+/[^/]+/[^/]+/$

example matching URL path: /retail/clothing/jeans/

Regex for exactly four sub-directories

^/[^/]+/[^/]+/[^/]+/[^/]+/$

 example matching URL path: /retail/clothing/jeans/low-rise-jeans-12345A/

(and so on…)

 

VARIATION 2: NO TRAILING SLASH

The above works in the case that all URLs end in a trailing slash. If they don’t, the regex needs to be altered as follows.

Regex for exactly one sub-directory + text

^/[^/]+/[^/]+[a-zA-Z0-9]$

 example matching URL path: /retail/clothing

Regex for exactly two sub-directories + text

^/[^/]+/[^/]+/[^/]+[a-zA-Z0-9]$

 example matching URL path: /retail/clothing/jeans

(and so on…)

 

VARIATION 3: AT LEAST X NUMBER OF SUBDIRECTORIES

If you want to match AT LEAST some number of sub-directories, just remove the initial caret. So regex to match at least one sub-directory would be: /[^/]+/$, regex to match at least two sub-directories would be /[^/]+/[^/]+/$, etc.

 

VARIATION 4: PATH SEGMENTS STARTING WITH A NUMBER

This is useful for the case where you have URL path segments that always start with a number. I see this a lot with product detail pages.

Regex for 1 subdirectory + path segment starting with a number

^/[^/]+/[^/][0-9]

 example matching URL path: /t-shirt/1234567-a-new-design

TEST VIA THE ALL PAGES REPORT

Before updating your Content Grouping or Goal Destination settings, test your regex condition. To do this, navigate to Behavior > Site Content > All Pages and click “advanced”.

subdirectory regex-img1

Once the advanced filter opens up, enter your regex condition like this:

subdirectory regex-img2

Click Apply and manually verify that the results look as expected.

Comments:19

  1. Hi, thanks for writing out this article. It’s been super helpful and is almost exactly what I’m looking for.

    However, I’d like to take “Variation 1” one step further.

    With the example, Regex for exactly one sub-directory:
    ^/[^/]+/$

    This will match any top-level directory, for example: /retail/.

    I’d like to do this, plus 2 and 3 directories deeper, but ideally I want to specify exactly what that directory is. In this example, matching only the /retail/ directory, and all subsequent subdirectories.

    Would love if you’d be able to explain that!

    1. That’s great, glad to hear it was useful!

      If you want to specify a specific directory, you should be able to type it in like this:

      ^/(retail)+/[^/]+/$ (2 subdirectories, including /retail/)
      ^/(retail)+/[^/]+/[^/]+/$ (3 subdirectories, including /retail/)
      ^/(retail)+/[^/]+/[^/]+/[^/]+/$ (4 subdirectories, including /retail/)

      Please try that and let me know if there’s any issue.

      1. Hi there! Thanks so much for this article. It’s incredibly helpful. I’m trying the above without luck. In my example there is an underscore in the first subdirectory I’d like to group the content by.
        Example:
        Goal: group all content containing two sub directories after /retail_store/, beginning with /retail_store/
        ^/(retail_store)+/[^/]+/$

        Any idea what I might be doing wrong?

      2. Thank you for your comment, much appreciated!

        If you have 2 subdirectories after /retail_store/, you’ll have a total of 3 subdirectories. So for that case you should use the following:
        ^/(retail_store)+/[^/]+/[^/]+/$

        It shouldn’t matter if there’s an underscore or not. Please try it out and let me know how it goes!

  2. Thanks so much, Ana! I’m still having some issues with this.

    For context, I’m trying to group content in GA using this regex. I want to group views to two different sub-directories:
    1. https://website.com/retail_store/123
    2. http://website.com/retail_store/123/456
    Pages with exactly two subdirectories starting with retail_store is one category, and pages with exactly three subdirectories starting with retail_store is the second category. Any other advice you might have would be awesome – I so appreciate your help!

      1. Hey there! Just realized I never received the notification for this. Your suggestion works! I’ll do some reading up around trailing slashes to make sure I don’t run into this issue again in the future. 🙂

  3. Super helpful post! Do you also have something in mind for filtering a Landing Page URL based on the amount of certain symbols. E.x. filter out URLs that countain 3 or more “_” ?

    1. I think you can use the following regex to capture URLs with 3 or more underscores:
      \_.*\_.*\_
      Please check it out and let me know if that works for you!

    1. Hey Himanshu! Thank you for the comment.

      In this case you should be able to make a regex like this:
      /abc/|/def/|/xyz/

      (the pipe character means OR).

      Does that work for you?

    1. Hi Roberto, if you just want to group by those specific pages in GA I don’t think you need regex at all, you can just make conditions like Page contains /landingpage and Page contains /usermanagement. Let me know if I’ve misunderstood what you’re looking for.

      1. Hi Ana,
        Thanks a lot for you feedback. I tried to follow your suggestion but it doesn’t work anyway.

        What could I do?

  4. Hi Ana – great article! Question.

    Is there a way to grab the beginning and end by required URL string, but disregard the amount of subdirectory paths?

    Example: website.com/abc/d-e/123 and website.com/abc/123

    I want to grab the URL string if both /abc/ and /123 are True, regardless of how many subdirectory paths there are. Does this make sense? Thanks in advance for any direction you can provide!

    1. Hey AF, sure, you can use .* as a wildcard.

      So if you want to match on /abc/ and /123, you’d use the following expression: /abc/.*123

      (I left the slash off the 123 so the regex will work even if it’s immediately following /abc/.You can test it in your All Pages report to make sure it pulls in the URLs you’re looking for.)

Leave a Reply

Your email address will not be published.