Given the URL (single line):
http://test.example.com/dir/subdir/file.html
How can I extract the following parts using regular expressions:
The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html
A single regex to parse and breakup a full URL including query parameters and anchors e.g.
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
RexEx positions:
url: RegExp['$&'],
protocol:RegExp.$2,
host:RegExp.$3,
path:RegExp.$4,
file:RegExp.$6,
query:RegExp.$7,
hash:RegExp.$8
you could then further parse the host ('.' delimited) quite easily.
What I would do is use something like this:
/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.
The link codesnippets.joyent.com/posts/show/523 does not work as of Oct 20 '10
The problem is this part:
(.*)?
Since the Kleene star already accepts 0 or more, the?
part (0 or 1) is confusing it. I fixed it by changing(.*)?
to(.+)?
. You could also just remove the?
Hi Dve, I've improved it a little more to extract example.com from urls like
http://www.example.com:8080/....
Here goes:^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$
and proof that no regexp is perfect, here's one immediate correction:
^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$
I modified this regex to identify all parts of the URL (improved version) - code in Python
^((?P<scheme>[^:/?#]+):(?=//))?(//)?(((?P<login>[^:]+)(?::(?P<password>[^@]+)?)?@)?(?P<host>[^@/?#:]*)(?::(?P<port>\d+)?)?)?(?P<path>[^?#]*)(\?(?P<query>[^#]*))?(#(?P<fragment>.*))?
code
You show this code in action on pythex.org