Unicode little introduction

Unicode

Unicode is a character encoding standard.

In Unicode, each character has code-points. Unicode has more than 1 million code-points representing character for every language.
Unicode is text less. For saving code-points in Unicode to disk we have to encode it.
Unicode code-points e.g. U+004E is N and for saving, it has be encoded to UTF-8 or some other encoding techniques like UTF-16 or Shift-JIS
- In UTF-8, first 128 character are encoded just like ASCII, so it takes 1-4 byte for per code-points for storing to disk.
- In UTF-16, we use 2-4 byte for storing per code-points. It is optimised for language that use 2 bytes to use character.
- In UTF-32, here it’s fixed 4 byte(32 bits) to store code-points. It’s fastest

Code points for ć character.

letter	C-Point
`ć`	`U+0107`

Byte Encodings

letter	UTF-8	UTF-16	Shift-JIS
`ć`	`\xc4\x87`	`\x07\x01`	`\x85\xc9`

Please give feedback at email@murarisumit.in