Updated URL regex for hashtags to disregard hashes used mid-sentence. #2361
Summary
Closes #2361
Simply, users would have one of their 5 hashtags used if posting a url with a hash in the query string, for example the invalid URL https://www.minds.com/#minds
The changed regex matches when:
- The first character is a space, or the start of a line.
- Followed by a hash.
- Followed atleast one word character
Previously, there was no whitespace check. As the match returns an array, this change would mess with the indexes of matches, which we use like url=${match[1]}&query=${match[2]}
, by incrementing the place of the value we want.
Rather than going through and changing the index to use match[match.length]
for every occurrence, I used ?:
so that the first group is not returned in the match array.
#Lorem#ipsum #dolor.sit amet, #consectetur adipiscing e#lit, https://loremipsum.io/#123#sed #do #eiusmod #tempo1=incididunt #132323 #@ #w #[ #] ' #word \ #labore@et dolore #magna aliqua. #Ut #enim #ad .minim #veniam, ^ #quis
#nostrud exercitation ullamco #laboris nisi
ut #aliquip ex ea commodo#consequat. Duis ()'#aute sad'#irure dolor
#in reprehen;'
#derit in@#voluptate velit essehttp: ci#https://loremipsum.io/#123llum dolore https://loremipsum.io/#123 https://loremipsum.io/#eu https://loremipsum.io/#fugiat https://loremipsum.io/#123 nulla pariatur. #Excepteur #sint #occaecat cupidatat
#non
#proident,
a #sunt in culpa qui officia deserunt mollit anim id est laborum.
Steps to test
This one is a bit more "get in there and try your best to break it" in terms of testing, but that said, to get a foot in the door:
- Go on the site, make a post with 5 hashtags; try to trip the Regex up and make some invalid tags to see how it responds e.g. hello#world.
- Try a URL with a hashtag in.
- Check the hashtags links go through to the correct place.
- Try to use the hashtag selector to bypass the limit
Estimated Regression Scope
The only thing making me pause to question is the old Regex goes like this/: /(^|\s||)#(\w+)/gim,
. It could be the source of a bug, or it could be a Regex trick I'm missing.
Take the first group (^|\s||)
- to me that looks like a bug. I'm not aware of any double OR operand in regex nor can I find a reference for it.
This appears to stop the \s
from being matched, and I presume also the ^
. Thus on production #hello#world counts as 2 hashtags.
This will change occurrences of that on production. #hello#world will match the hashtag #hello, and ignore the #world.
I think this is correct behavior, but changing it needs discussion, because it may invalidate people's tags in posts that for whatever reason, like to #combine#their#tags#like#this
added 2h of time spent at 2020-01-29
- Developer
I want a review on the regex before I fix the spec tests up here because there's consideration to be had.
- Owner
We need spec tests to verify why this change is being made.
closed