percent encoding in rfc1738 rfc2396 and rfc3986

URI && URL

URI(统一资源标识)定义在RFC 3986中,URL(统一资源地址)是URI的一种特殊形式,提供了资源的网络位置。URL最初的标准RFC 1738已经废弃,其中定义了URL,域名,IP地址,application/x-www-form-urlencoded格式。

encode && decode URI

encode是将URI根据RFC规范定义的reservedunreserved之外的所有字符都percent-encoding转义,避免解析时产生岐义和错误。比如,一个用户可能会输入Thyme &time=again作为 comment 变量的一部分。如果不使用encodeURIComponent对此内容进行转义,服务器得到的将是comment=Thyme%20&time=again。请注意,&符号和=符号产生了一个新的键值对,所以服务器得到两个键值对(一个键值对是comment=Thyme,另一个则是time=again),而不是一个键值对。

encodedecode一般是配对使用,比如表单 POST 提交的内容是按application/x-www-form-urlencoded编码提交到后台的,可以用 PHP 的urldecode方法解码,在 Java 中则可以用java.net.URLDecode.decode(String s, String enc)方法来解码。如果前端使用encodeURIComponent编码后 ajax POST 到后端,则需要用 PHP 中的rawurlencode解码,在 Java 中则可以用 spring 的工具类org.springframework.web.util.UriUtils.decode(String source, String encoding)解码。

application/x-www-form-urlencoded 编码规则

5.2. application/x-www-form-urlencoded serializingThe application/x-www-form-urlencoded byte serializer takes a byte sequence input and then runs these steps:Let output be the empty string.For each byte in input, depending on byte:0x20 (SP)Append U+002B (+) to output.0x2A (*)0x2D (-)0x2E (.)0x30 (0) to 0x39 (9)0x41 (A) to 0x5A (Z)0x5F (_)0x61 (a) to 0x7A (z)Append a code point whose value is byte to output.OtherwiseAppend byte, percent encoded, to output.Return output.

如上application/x-www-form-urlencoded编码规则,alphanumeric*-._不进行编码,空格则编码成+,其他字符全部需要编码,这就是为什么 Google 搜索时空格在URL上会变成+而不是%20

RFC 3986 中字符定义

RFC 3986                   URI Generic Syntax               January 20052.1.  Percent-Encoding   A percent-encoding mechanism is used to represent a data octet in a   component when that octet's corresponding character is outside the   allowed set or is being used as a delimiter of, or within, the   component.  A percent-encoded octet is encoded as a character   triplet, consisting of the percent character "%" followed by the two   hexadecimal digits representing that octet's numeric value.  For   example, "%20" is the percent-encoding for the binary octet   "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space   character (SP).  Section 2.4 describes when percent-encoding and   decoding is applied.      pct-encoded = "%" HEXDIG HEXDIG   The uppercase hexadecimal digits 'A' through 'F' are equivalent to   the lowercase digits 'a' through 'f', respectively.  If two URIs   differ only in the case of hexadecimal digits used in percent-encoded   octets, they are equivalent.  For consistency, URI producers and   normalizers should use uppercase hexadecimal digits for all percent-   encodings.2.2.  Reserved Characters   URIs include components and subcomponents that are delimited by   characters in the "reserved" set.  These characters are called   "reserved" because they may (or may not) be defined as delimiters by   the generic syntax, by each scheme-specific syntax, or by the   implementation-specific syntax of a URI's dereferencing algorithm.   If data for a URI component would conflict with a reserved   character's purpose as a delimiter, then the conflicting data must be   percent-encoded before the URI is formed.  reserved    = gen-delims / sub-delims      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"                  / "*" / "+" / "," / ";" / "="2.3.  Unreserved Characters   Characters that are allowed in a URI but do not have a reserved   purpose are called unreserved.  These include uppercase and lowercase   letters, decimal digits, hyphen, period, underscore, and tilde.      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

RFC 2396 中的部分字符定义

Javascript 中的encodeURI方法对于rfc2396中定义的reservedunreserved这2个字符集中的字符都不会进行转义。

2.2. Reserved Characters      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |                    "$" | ","2.3. Unreserved Characters      unreserved  = alphanum | mark      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

RFC 1738 中部分字符定义

RFC 1738            Uniform Resource Locators (URL)        December 1994safe           = "$" | "-" | "_" | "." | "+"extra          = "!" | "*" | "'" | "(" | ")" | ","reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |                 "a" | "b" | "c" | "d" | "e" | "f"escape         = "%" hex hexunreserved     = alpha | digit | safe | extra

encodeURI

encodeURI不转义的字符如下:

A-Z a-z 0-9 ; , / ? : @ & = + $ - _ . ! ~ * ' ( ) #

除上述对 URI 有特殊含义的保留字符,encodeURI会转义其他所有字符,对于 GET/POST 请求中有特殊含义的&+=encodeURI并不会进行转义,会造成不正确的GET/POST 请求,可以使用encodeURIComponent方法进行转义这些字符。

encodeURIComponent

javascript 中对 URI 编码一般是使用encodeURIdecodeURIencodeURIComponentdecodeURIComponent这4个方法,而不使用escapeunescape方法。

Another interesting consideration for Global methods is the escaping of strings provided by escape() and unescape(). Primarily, we see this done on the Web in order to create URL safe strings. You probably have seen this when working with forms. While these methods would be extremely useful, the ECMAScript specification suggests that escape() and unescape() are deprecated in favor of the more aptly named encodeURI(), encodeURIComponent(), decodeURI(), and decodeURIComponent().

encodeURIComponent不转义的字符如下:

A-Z a-z 0-9 - _ . ~ ! * ' ( )

encodeURIComponent转义除了上述字符之外的所有字符。

!'()*这几个字符虽然是rfc3986的保留字,但是没有真正在URI中作为分隔字符使用,可以与 PHP 中的rawurlencode一样,将这几个字符也进行编码。

定义 Javascript 版本的 rawurlencode

function rawurlencode(str) {    return encodeURIComponent(str).replace(/[!'()*]/g, function(c) {        return '%' + c.charCodeAt(0).toString(16).toUpperCase();    });}

运行结果

 node  > rawurlencode("+|`^!'()*~-_.09azAZ")"%2B%7C%60%5E%21%27%28%29%2A~-_.09azAZ"  > encodeURI("+|`^!'()*~-_.09azAZ")"+%7C%60%5E!'()*~-_.09azAZ"

decodeURIComponent

如果 URL 里字符串包含%符号而后面没有2个16进制数字的话,再调用decodeURIdecodeURIComponent会报错误misformed error或者URI malformed,如果需要表示百分号本身%,必须编码为%25

@exception IllegalArgumentException if a '%' character is not followed by a valid 2-digit hexadecimal number

rawurlencode in PHP

Javascript 中的decodeURIComponent和 PHP 中的rawurlencode,如果不包括!*'()这些字符,那产生的结果是一致的,二者对-_.~这些字符都不会处理(~这个字符在 PHP-5.3.0 之前版本的rawurlencode中也会被转义)。

测试代码 rawurlencode.php

<?php    echo rawurlencode("+|`^!'()*~-_.09azAZ");?>

运行结果

 php -f rawurlencode.php%2B%7C%60%5E%21%27%28%29%2A~-_.09azAZ

urlencode in PHP

urlencode方法返回字符串,此字符串中除了-_.之外的所有非字母数字字符都将被替换成百分号%后跟两位十六进制数,空格则编码为加号+。此编码与 WWW 表单 POST 数据的编码方式是一样的,同时与application/x-www-form-urlencoded的媒体类型编码方式一样。由于历史原因,此编码在将空格编码为加号+方面与rfc3896编码不同。

"+"字符编码和解码时碰到的问题

编码解码中最容易出问题的字符就是+,上面提到前端编码可以产生以下结果:

  1. application/x-www-form-urlencoded编码规则中,会将空格编码为+
  2. encodeURI+是保留字,并不会被编码,结果仍然是+自身。
  3. encodeURIComponent会将+编码为%2B

后端可能收到的编码内容如下:

  1. 收到的是%2B
  2. 收到的是+本身。

对于%2B解码为+,不会有疑义,但是如果后端收到的是+,这就可能有岐义了,用 PHP 的urldecode和 Java 的java.net.URLDecode.decode(String s, String enc)方法解码,则会将+解码为空格,而rawurlencode解码之后仍然是+,这可能会造成解码的结果与编码前的内容不一致,所以前端应该使用encodeURIComponent方法将参数编码后传给后端程序处理。

urldecode.php 文件

<?php    echo urldecode("+");    // 输出一个空格' '    echo rawurldecode("+"); // 输出加号自身'+'?>

tomcat 请求中的字符问题

在 tomcat 请求日志中有时可以看到如下错误:

Apr 21, 2018 9:19:15 AM org.apache.coyote.http11.Http11Processor serviceINFO: Error parsing HTTP request headerNote: further occurrences of HTTP header parsing errors will be logged at DEBUG level.java.lang.IllegalArgumentException: Invalid character found in the request target. The valid characters are defined in RFC 7230 and RFC 3986    at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:476)    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:687)    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459)    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)    at java.lang.Thread.run(Thread.java:748)

org.apache.tomcat.util.http.parser.HttpParser文件第 100 行左右的代码如下,在这里定义了哪些字符不能在 URL 中出现:

// Not valid for request target.// Combination of multiple rules from RFC7230 and RFC 3986. Must be// ASCII, no controls plus a few additional characters excludedif (IS_CONTROL[i] || i > 127 ||        i == ' ' || i == '\"' || i == '#' || i == '<' || i == '>' || i == '\\' ||        i == '^' || i == '`'  || i == '{' || i == '|' || i == '}') {        IS_NOT_REQUEST_TARGET[i] = true;}

测试

# 错误的请求如下 curl "http://localhost:8080/hello?a=>"# 编码之后正确的请求如下 curl "http://localhost:8080/hello?a=%3E"

相关 RFC 规范

  1. RFC 1738
  2. RFC 2396
  3. RFC 2616
  4. RFC 3986

References

  1. php rawurlencode
  2. php urlencode
  3. application/x-www-form-urlencoded serializing
  4. encodeURIComponent
  5. 百分号编码与 encodeURIComponent
  6. 浅谈 HTTP URL 规范