URI && URL
URI(统一资源标识)定义在RFC 3986中,URL(统一资源地址)是URI的一种特殊形式,提供了资源的网络位置。URL最初的标准RFC 1738已经废弃,其中定义了URL,域名,IP地址,application/x-www-form-urlencoded格式。
encode && decode URI
encode是将URI根据RFC规范定义的reserved和unreserved之外的所有字符都percent-encoding转义,避免解析时产生岐义和错误。比如,一个用户可能会输入Thyme &time=again作为 comment 变量的一部分。如果不使用encodeURIComponent对此内容进行转义,服务器得到的将是comment=Thyme%20&time=again。请注意,&符号和=符号产生了一个新的键值对,所以服务器得到两个键值对(一个键值对是comment=Thyme,另一个则是time=again),而不是一个键值对。
encode和decode一般是配对使用,比如表单 POST 提交的内容是按application/x-www-form-urlencoded编码提交到后台的,可以用 PHP 的urldecode方法解码,在 Java 中则可以用java.net.URLDecode.decode(String s, String enc)方法来解码。如果前端使用encodeURIComponent编码后 ajax POST 到后端,则需要用 PHP 中的rawurlencode解码,在 Java 中则可以用 spring 的工具类org.springframework.web.util.UriUtils.decode(String source, String encoding)解码。
application/x-www-form-urlencoded 编码规则
5.2. application/x-www-form-urlencoded serializingThe application/x-www-form-urlencoded byte serializer takes a byte sequence input and then runs these steps:Let output be the empty string.For each byte in input, depending on byte:0x20 (SP)Append U+002B (+) to output.0x2A (*)0x2D (-)0x2E (.)0x30 (0) to 0x39 (9)0x41 (A) to 0x5A (Z)0x5F (_)0x61 (a) to 0x7A (z)Append a code point whose value is byte to output.OtherwiseAppend byte, percent encoded, to output.Return output. |
如上application/x-www-form-urlencoded编码规则,alphanumeric、*、-、.、_不进行编码,空格则编码成+,其他字符全部需要编码,这就是为什么 Google 搜索时空格在URL上会变成+而不是%20。
RFC 3986 中字符定义
RFC 3986 URI Generic Syntax January 20052.1. Percent-Encoding A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing that octet's numeric value. For example, "%20" is the percent-encoding for the binary octet "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space character (SP). Section 2.4 describes when percent-encoding and decoding is applied. pct-encoded = "%" HEXDIG HEXDIG The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f', respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent- encodings.2.2. Reserved Characters URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed. reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="2.3. Unreserved Characters Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" |
RFC 2396 中的部分字符定义
Javascript 中的encodeURI方法对于rfc2396中定义的reserved和unreserved这2个字符集中的字符都不会进行转义。
2.2. Reserved Characters reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","2.3. Unreserved Characters unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" |
RFC 1738 中部分字符定义
RFC 1738 Uniform Resource Locators (URL) December 1994safe = "$" | "-" | "_" | "." | "+"extra = "!" | "*" | "'" | "(" | ")" | ","reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f"escape = "%" hex hexunreserved = alpha | digit | safe | extra |