صيغة التحويل الموحد-8

UTF-8
Standard	Unicode Standard
Classification	Unicode Transformation Format, extended ASCII, variable-width encoding
Extends	US-ASCII
Transforms / Encodes	ISO 10646 (Unicode)
Preceded by	UTF-1
	v; t; e;

UTF-8 هي اختصار للجملة (8-bit Unicode Transformation Format) وترجمتها (صيغة تحويل نظام الحروف الدولي الموحد بقوة 8 بت) ، هذا الترميز وضع من قبل كل من روب بايك و كين تومسن لتمثيل معيار نظام الحروف الدولي الموحد للحروف الأبجدية لأغلب دول العالم ، ويتم تشفير الرموز فيها في حجم يتراوح بين بايت واحد و4 بايت للرمز الواحد .

يتم تحديد طول تشفير الرمز بحسب بالشكل الآتي:

إذا كان قيمة البايت الأول أقل من 127، أي أن البت الثامن يساوي صفر، فإن هذا البايت هو كامل تشفير الرمز، وبالتالي طوله واحد بايت، تقع قيم ASCII في هذا المجال.
إذا كان قيمة البايت الأول أكبر من 127، أي أن قيمة البت الثامن يساوي واحد، فإن تشفير الرمز متعدد البايتات حسب الأتي:
- لا يجوز أن يكون البت الثامن من البايت الأول مساويا لواحد والبت السابع يساوي صفر، ووقوع مثل هذه الحالة في البايت الأول من التشفير تعني أن هناك خطأ إما في التشفير أو في طريقة القراءة، فهذه القيم مسموحة في البايت الثاني والثالث والرابع ولكن ليس الأول.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي صفر، فإن طول التشفير هو 2 بايت.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي واحد والخامس يساوي صفر، فإن طول التشفير هو 3 بايت.
- إذا كان البت الثامن من البايت الأول مساويا لواحد وكذلك البت السابع مساويا لواحد والبت السادس يساوي واحد والخامس يساوي واحد والرابع يساوي صفر، فإن طول التشفير هو 4 بايت.

أمثلة

Examples of UTF-8 encoding
Character		Binary code point	Binary UTF-8	Hex UTF-8
$	U+0024	010 0100	00100100	24
¢	U+00A2	000 1010 0010	11000010 10100010	C2 A2
ह	U+0939	0000 1001 0011 1001	11100000 10100100 10111001	E0 A4 B9
€	U+20AC	0010 0000 1010 1100	11100010 10000010 10101100	E2 82 AC
한	U+D55C	1101 0101 0101 1100	11101101 10010101 10011100	ED 95 9C
𐍈	U+10348	0 0001 0000 0011 0100 1000	11110000 10010000 10001101 10001000	F0 90 8D 88

Octal

UTF-8's use of six bits per byte to represent the actual characters being encoded, means that octal notation (which uses 3-bit groups) can aid in the comparison of UTF-8 sequences with one another and in manual conversion.^[1]

Octal code point <-> Octal UTF-8 conversion
First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
0	177	xxx
200	3777	3xx	2xx
4000	77777	34x	2xx	2xx
100000	177777	35x	2xx	2xx
200000	4177777	36x	2xx	2xx	2xx

With octal notation, the arbitrary octal digits, marked with x in the table, will remain unchanged when converting to or from UTF-8.

Example: € = U+20AC = 02 02 54 is encoded as 342 202 254 in UTF-8 (E2 82 AC in hex).

Codepage layout

The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half (0_ to 7_) is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes (8_ to B_) and leading bytes (C_ to F_), and is explained further in the legend below.

UTF-8
	_0	_1	_2	_3	_4	_5	_6	_7	_8	_9	_A	_B	_C	_D	_E	_F
(1 byte) 0_	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
(1) 1_	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
(1) 2_	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
(1) 3_	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
(1) 4_	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
(1) 5_	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
(1) 6_	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
(1) 7_	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	DEL
8_	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•
9_	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•
A_	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•
B_	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•	•
(2) C_	2	2	Latin	Latin	Latin	Latin	Latin	Latin	Latin	IPA	IPA	IPA	accents	accents	Greek	Greek
(2) D_	Cyril	Cyril	Cyril	Cyril	Cyril	Armeni	Hebrew	Hebrew	Arabic	Arabic	Arabic	Arabic	Syriac	Arabic	Thaana	N'Ko
(3) E_	Indic	Misc.	Symbol	Kana…	CJK	CJK	CJK	CJK	CJK	CJK	Asian	Hangul	Hangul	Hangul	PUA	Forms
(4) F_	SMP…			SSP…	SPU…	4	4	4	5	5	5	5	6	6

Blue cells are 7-bit (single-byte) sequences. They must not be followed by a continuation byte.^[2]

Orange cells with a large dot are a continuation byte.^[3] The hexadecimal number shown after the + symbol is the value of the 6 bits they add. This character never occurs as the first byte of a multi-byte sequence.

White cells are the leading bytes for a sequence of multiple bytes,^[4] the length shown at the left edge of the row. The text shows the Unicode blocks encoded by sequences starting with this byte, and the hexadecimal code point shown in the cell is the lowest character value encoded using that leading byte.

Red cells must never appear in a valid UTF-8 sequence. The first two red cells (C0 and C1) could be used only for a 2-byte encoding of a 7-bit ASCII character which should be encoded in 1 byte; as described below, such "overlong" sequences are disallowed.^[5] To understand why this is, consider the character 128, hex 80, binary 1000 0000. To encode it as 2 characters, the low six bits are stored in the second character as 128 itself 10 000000, but the upper two bits are stored in the first character as 110 00010, making the minimum first character C2. The red cells in the F_ row (F5 to FD) indicate leading bytes of 4-byte or longer sequences that cannot be valid because they would encode code points larger than the U+10FFFF limit of Unicode (a limit derived from the maximum code point encodable in UTF-16 ^[6]). FE and FF do not match any allowed character pattern and are therefore not valid start bytes.^[7]

Pink cells are the leading bytes for a sequence of multiple bytes, of which some, but not all, possible continuation sequences are valid. E0 and F0 could start overlong encodings, in this case the lowest non-overlong-encoded code point is shown. F4 can start code points greater than U+10FFFF which are invalid. ED can start the encoding of a code point in the range U+D800–U+DFFF; these are invalid since they are reserved for UTF-16 surrogate halves.^[8]

Overlong encodings

In principle, it would be possible to inflate the number of bytes in an encoding by padding the code point with leading 0s. To encode the Euro sign € from the above example in four bytes instead of three, it could be padded with leading 0s until it was 21 bits long – 000 000010 000010 101100, and encoded as 11110000 10000010 10000010 10101100 (or F0 82 82 AC in hexadecimal). This is called an overlong encoding.

انظر أيضاً

ملاحظات

الهامش

^ "BinaryString (flink 1.9-SNAPSHOT API)". ci.apache.org. Retrieved 2021-03-24.
^ "Chapter 3", The Unicode Standard, p. 54
^ "Chapter 3", The Unicode Standard, p. 55
^ "Chapter 3", The Unicode Standard, p. 55
^ "Chapter 3", The Unicode Standard, p. 54
^ قالب:Cite IETF
^ "Chapter 3", The Unicode Standard, p. 55
^ قالب:Cite IETF

وصلات خارجية

Original UTF-8 paper (or pdf) for Plan 9 from Bell Labs
UTF-8 test pages:
Unix/Linux: UTF-8/Unicode FAQ, Linux Unicode HOWTO, 8.xml UTF-8 and Gentoo
Characters, Symbols and the Unicode Miracle at YouTube

قالب:Character encoding قالب:Rob Pike navbox قالب:Ken Thompson navbox

الكلمات الدالة:

روب بايك

[1] "BinaryString (flink 1.9-SNAPSHOT API)". ci.apache.org. Retrieved 2021-03-24.

[2] "Chapter 3", The Unicode Standard, p. 54

[3] "Chapter 3", The Unicode Standard, p. 55

[4] "Chapter 3", The Unicode Standard, p. 55

[5] "Chapter 3", The Unicode Standard, p. 54

[6] قالب:Cite IETF

[7] "Chapter 3", The Unicode Standard, p. 55

[8] قالب:Cite IETF

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]