php multibyte function overloading in php must be disabled for string functions

20.07.202220.07.2022 admin 0 Comments

Why use multibyte string functions in PHP?

6 Answers 6

All of the PHP string functions do not handle multibyte strings regardless of your operating system’s locale. That is why you need to use the multibyte string functions.

When you manipulate (trim, split, splice, etc.) strings encoded in a multibyte encoding, you need to use special functions since two or more consecutive bytes may represent a single character in such encoding schemes. Otherwise, if you apply a non-multibyte-aware string function to the string, it probably fails to detect the beginning or ending of the multibyte character and ends up with a corrupted garbage string that most likely loses its original meaning.

People here don’t understand UTF-8.

You do not need to use UTF-8 aware code to process UTF-8. For the most part.

I’ve even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It’s hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.

It’s very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won’t work 😉 because of decomposed characters.

But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.

It’s because no UTF-8 character can be found inside any other UTF-8 character. That’s how it is designed.

Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.

Here is my answer in plain English. A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP’s standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of «Hello World!» in Japanese or Chinese or Korean will have more than 12 bytes.

PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.

However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.

So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.

The same applies to preg_replace() : If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.

No sane ‘default’ is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.

That’s why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.

Источник

Drupal Русскоязычное сообщество

php_value magic_quotes_gpc 0
php_value register_globals 0
php_value session.auto_start 0
php_value mbstring.http_input pass
php_value mbstring.http_output pass
php_value mbstring.encoding_translation 0

# PHP 5, Apache 1 and 2.

Тыкните мне пальцом куда ее написать и как именно выглядит строчка. Потом я так поняла, надо будет закачать этот файл с заменой в папку.
Считайте, что я блондинка.

Function Overloading Feature

This feature has been DEPRECATED as of PHP 7.2.0. Relying on this feature is highly discouraged.

mbstring supports a ‘function overloading’ feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions. For example, mb_substr() is called instead of substr() if function overloading is enabled. This feature makes it easy to port applications that only support single-byte encodings to a multibyte environment in many cases.

To use function overloading, set mbstring.func_overload in php.ini to a positive value that represents a combination of bitmasks specifying the categories of functions to be overloaded. It should be set to 1 to overload the mail() function. 2 for string functions, 4 for regular expression functions. For example, if it is set to 7, mail, strings and regular expression functions will be overloaded. The list of overloaded functions are shown below.

Functions to be overloaded

value of mbstring.func_overload	original function	overloaded function
1	mail()	mb_send_mail()
2	strlen()	mb_strlen()
2	strpos()	mb_strpos()
2	strrpos()	mb_strrpos()
2	substr()	mb_substr()
2	strtolower()	mb_strtolower()
2	strtoupper()	mb_strtoupper()
2	stripos()	mb_stripos()
2	strripos()	mb_strripos()
2	strstr()	mb_strstr()
2	stristr()	mb_stristr()
2	strrchr()	mb_strrchr()
2	substr_count()	mb_substr_count()

It is not recommended to use the function overloading option in the per-directory context, because it’s not confirmed yet to be stable enough in a production environment and may lead to undefined behaviour.

Источник

Multibyte String Functions

Multibyte character encoding schemes and their related issues are fairly complicated, and are beyond the scope of this documentation. Please refer to the following URLs and other resources for further information regarding these topics.

Japanese/Korean/Chinese character information

User Contributed Notes 35 notes

Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

= ‘漢字はユニコード’ ;
$needle = ‘は’ ;
$replace = ‘Foo’ ;

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value «from outside» somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving «from outside». Therefore the binary representations don’t match and nothing happens.

PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says «Unicode», it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP’s «UTF-16» means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

SOME multibyte encodings can safely be used in str_replace() and the like, others cannot. It’s not enough to ensure that all the strings involved use the same encoding: obviously they have to, but it’s not enough. It has to be the right sort of encoding.

UTF-8 is one of the safe ones, because it was designed to be unambiguous about where each encoded character begins and ends in the string of bytes that makes up the encoded text. Some encodings are not safe: the last bytes of one character in a text followed by the first bytes of the next character may together make a valid character. str_replace() knows nothing about «characters», «character encodings» or «encoded text». It only knows about the string of bytes. To str_replace(), two adjacent characters with two-byte encodings just looks like a sequence of four bytes and it’s not going to know it shouldn’t try to match the middle two bytes.

While real-world examples can be found of str_replace() mangling text, it can be illustrated by using the HTML-ENTITIES encoding. It’s not one of the safe ones. All of the strings being passed to str_replace() are valid HTML-ENTITIES-encoded text so the «all inputs use the same encoding» rule is satisfied.

The text is «x = ‘x ;
mb_internal_encoding ( ‘HTML-ENTITIES’ );

Even though neither ‘l’ nor ‘;’ appear in the text «x y» and in the other it broke the encoding completely.

One more reason to use UTF-8 if you can, I guess.

Yet another single-line mb_trim() function

PHP5 has no mb_trim(), so here’s one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Источник

LII. Multi-Byte String Functions

There are many languages in which all characters can be expressed by single byte. Multi-byte character codes are used to express many characters for many languages. mbstring is developed to handle Japanese characters. However, many mbstring functions are able to handle character encoding other than Japanese.

A multi-byte character encoding represents single character with consecutive bytes. Some character encoding has shift(escape) sequences to start/end multi-byte character strings. Therefore, a multi-byte character string may be destroyed when it is divided and/or counted unless multi-byte character encoding safe method is used. This module provides multi-byte character safe string functions and other utility functions such as conversion functions.

Since PHP is basically designed for ISO-8859-1, some multi-byte character encoding does not work well with PHP. Therefore, it is important to set mbstring.internal_encoding to a character encoding that works with PHP.

PHP4 Character Encoding Requirements

Single byte characters in range of 00h-7fh which is compatible with ASCII

Multi-byte characters without 00h-7fh

These are examples of internal character encoding that works with PHP and does NOT work with PHP.

Character encodings work with PHP: ISO-8859-*, EUC-JP, UTF-8 Character encodings do NOT work with PHP: JIS, SJIS

Character encoding, that does not work with PHP, may be converted with mbstring ‘s HTTP input/output conversion feature/function.

Замечание: SJIS should not be used for internal encoding unless the reader is familiar with parser/compiler, character encoding and character encoding issues.

Замечание: If you use databases with PHP, it is recommended that you use the same character encoding for both database and internal encoding for ease of use and better performance.

If you are using PostgreSQL, it supports character encoding that is different from backend character encoding. See the PostgreSQL manual for details.

mbstring is an extended module. You must enable the module with the configure script. Refer to the Install section for details.

The following configure options are related to the mbstring module.

—enable-mbstring : Enable mbstring functions. This option is required to use mbstring functions.

—enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using mbstring conversion engine. If this feature is enabled, HTTP input character encoding may be converted to mbstring.internal_encoding automatically.

—enable-mbregex : Enable regular expression functions with multibyte character support.

Таблица 1. Multi-Byte String configuration options

Name	Default	Changeable
mbstring.language	NULL	PHP_INI_ALL
mbstring.detect_order	NULL	PHP_INI_ALL
mbstring.http_input	NULL	PHP_INI_ALL
mbstring.http_output	NULL	PHP_INI_ALL
mbstring.internal_encoding	NULL	PHP_INI_ALL
mbstring.script_encoding	NULL	PHP_INI_ALL
mbstring.substitute_character	NULL	PHP_INI_ALL
mbstring.func_overload	«0»	PHP_INI_SYSTEM
mbstring.encoding_translation	«0»	PHP_INI_ALL

Here is a short explanation of the configuration directives.

mbstring.language defines default language used in mbstring. Note that this option defines mbstring.interanl_encoding and mbstring.interanl_encoding should be placed after mbstring.language in php.ini

mbstring.encoding_translation enables HTTP input character encoding detection and translation into internal chatacter encoding.

mbstring.internal_encoding defines default internal character encoding.

mbstring.http_input defines default HTTP input character encoding.

mbstring.http_output defines default HTTP output character encoding.

mbstring.substitute_character defines character to substitute for invalid character encoding.

Web Browsers are supposed to use the same character encoding when submitting form. However, browsers may not use the same character encoding. See mb_http_input() to detect character encoding used by browsers.

If enctype is set to multipart/form-data in HTML forms, mbstring does not convert character encoding in POST data. The user must convert them in the script, if conversion is needed.

Although, browsers are smart enough to detect character encoding in HTML. charset is better to be set in HTTP header. Change default_charset according to character encoding.

Пример 1. php.ini setting example

; Set default language mbstring.language = English; Set default language to English (default) mbstring.language = Japanese; Set default language to Japanese ;; Set default internal encoding ;; Note: Make sure to use character encoding works with PHP mbstring.internal_encoding = UTF-8 ; Set internal encoding to UTF-8 ;; HTTP input encoding translation is enabled. mbstring.encoding_translation = On ;; Set default HTTP input character encoding ;; Note: Script cannot change http_input setting. mbstring.http_input = pass ; No conversion. mbstring.http_input = auto ; Set HTTP input to auto ; «auto» is expanded to «ASCII,JIS,UTF-8,EUC-JP,SJIS» mbstring.http_input = SJIS ; Set HTTP2 input to SJIS mbstring.http_input = UTF-8,SJIS,EUC-JP ; Specify order ;; Set default HTTP output character encoding mbstring.http_output = pass ; No conversion mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8 ;; Set default character encoding detection order mbstring.detect_order = auto ; Set detect order to auto mbstring.detect_order = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order ;; Set default substitute character mbstring.substitute_character = 12307 ; Specify Unicode value mbstring.substitute_character = none ; Do not print character mbstring.substitute_character = long ; Long Example: U+3000,JIS+7E7E

Пример 2. php.ini setting for EUC-JP users

;; Disable Output Buffering output_buffering = Off ;; Set HTTP header charset default_charset = EUC-JP ;; Set default language to Japanese mbstring.language = Japanese ;; HTTP input encoding translation is enabled. mbstring.encoding_translation = On ;; Set HTTP input encoding conversion to auto mbstring.http_input = auto ;; Convert HTTP output to EUC-JP mbstring.http_output = EUC-JP ;; Set internal encoding to EUC-JP mbstring.internal_encoding = EUC-JP ;; Do not print invalid characters mbstring.substitute_character = none

Пример 3. php.ini setting for SJIS users

;; Enable Output Buffering output_buffering = On ;; Set mb_output_handler to enable output conversion output_handler = mb_output_handler ;; Set HTTP header charset default_charset = Shift_JIS ;; Set default language to Japanese mbstring.language = Japanese ;; Set http input encoding conversion to auto mbstring.http_input = auto ;; Convert to SJIS mbstring.http_output = SJIS ;; Set internal encoding to EUC-JP mbstring.internal_encoding = EUC-JP ;; Do not print invalid characters mbstring.substitute_character = none

Данное расширение не определяет никакие типы ресурсов.

Перечисленные ниже константы определены данным расширением и могут быть доступны только в том случае, если PHP был собран с поддержкой этого расширения или же в том случае, если данное расширение подгружается во время выполнения.

HTTP input/output character encoding conversion may convert binary data also. Users are supposed to control character encoding conversion if binary data is used for HTTP input/output.

Пример 4. Disable HTTP input conversion in php.ini

;; Disable HTTP Input conversion mbstring.http_input = pass ;; Disable HTTP Input conversion (PHP 4.3.0 or higher) mbstring.encoding_translation = Off

Замечание: For PHP3-i18n users, mbstring ‘s output conversion differs from PHP3-i18n. Character encoding is converted using output buffer.

Пример 5. php.ini setting example

;; Enable output character encoding conversion for all PHP pages ;; Enable Output Buffering output_buffering = On ;; Set mb_output_handler to enable output conversion output_handler = mb_output_handler

Пример 6. Script example

Currently, the following character encoding is supported by the mbstring module. Character encoding may be specified for mbstring functions’ encoding parameter.

The following character encoding is supported in this PHP extension:

php.ini entry, which accepts encoding name, accepts » auto » and » pass » also. mbstring functions, which accepts encoding name, and accepts » auto «.

If » pass » is set, no character encoding conversion is performed.

If » auto » is set, it is expanded to » ASCII,JIS,UTF-8,EUC-JP,SJIS «.

Замечание: «Supported character encoding» does not mean that it works as internal character code.

Because almost PHP application written for language using single-byte character encoding, there are some difficulties for multibyte string handling including japanese. Almost PHP string functions such as substr() do not support multibyte string.

Multibyte extension (mbstring) has some PHP string functions with multibyte support (ex. substr() supports mb_substr() ).

Multibyte extension (mbstring) also supports ‘function overloading’ to add multibyte string functionality without code modification. Using function overloading, some PHP string functions will be oveloaded multibyte string functions. For example, mb_substr() is called instead of substr() if function overloading is enabled. Function overload makes easy to port application supporting only single-byte encoding for multibyte application.

mbstring.func_overload in php.ini should be set some positive value to use function overloading. The value should specify the category of overloading functions, sbould be set 1 to enable mail function overloading. 2 to enable string functions, 4 to regular expression functions. For example, if is set for 7, mail, strings, regex functions should be overloaded. The list of overloaded functions are shown in below.

Таблица 2. Functions to be overloaded

value of mbstring.func_overload	original function	overloaded function
1	mail()	mb_send_mail()
2	strlen()	mb_strlen()
2	strpos()	mb_strpos()
2	strrpos()	mb_strrpos()
2	substr()	mb_substr()
2	strtolower()	mb_strtolower()
2	strtoupper()	mb_strtoupper()
2	substr_count()	mb_substr_count()
4	ereg()	mb_ereg()
4	eregi()	mb_eregi()
4	ereg_replace()	mb_ereg_replace()
4	eregi_replace()	mb_eregi_replace()
4	split()	mb_split()

Most Japanese characters need more than 1 byte per character. In addition, several character encoding schemas are used under a Japanese environment. There are EUC-JP, Shift_JIS(SJIS) and ISO-2022-JP(JIS) character encoding. As Unicode becomes popular, UTF-8 is used also. To develop Web applications for a Japanese environment, it is important to use the character set for the task in hand, whether HTTP input/output, RDBMS and E-mail.

Storage for a character can be up to six bytes

Some character encoding defines shift(escape) sequence for entering/exiting multi-byte character strings.

ISO-2022-JP must be used for SMTP/NNTP.

«i-mode» web site is supposed to use SJIS.

Multi-byte character encoding and its related issues are very complex. It is impossible to cover in sufficient detail here. Please refer to the following URLs and other resources for further readings.

Japanese/Korean/Chinese character information

Источник

Бизнес портал

php multibyte function overloading in php must be disabled for string functions

Why use multibyte string functions in PHP?

6 Answers 6

Drupal Русскоязычное сообщество

Комментарии

Function Overloading Feature

Multibyte String Functions

Table of Contents

User Contributed Notes 35 notes

LII. Multi-Byte String Functions

Добавить комментарий Отменить ответ

Why use multibyte string functions in PHP?

6 Answers 6

Drupal Русскоязычное сообщество

Комментарии

Function Overloading Feature

Multibyte String Functions

Table of Contents

User Contributed Notes 35 notes

LII. Multi-Byte String Functions

Вам также понравится

Черный баллон с чем

какие навыки вырабатываются у дошкольников к концу обучения в детском саду

Идеальное нитро асфальт 9 что это

Добавить комментарий Отменить ответ