How to tell where letters begin and end in hex Thread poster: Samuel Murray
| Samuel Murray Netherlands Local time: 23:18 Member (2006) English to Afrikaans + ...
Hello everyone The hex for " ÿ " is "20 C3 BF 20". The spaces are "20" and the "ÿ" is "C3 BF". How can I tell that "20" is the first letter, and not "20 C3"? And how can I tell that "C3 BF" is the second letter, and not just "C3"? Thanks Samuel | | | It is all about UTF-8 and BOM (Byte order mark) | Oct 19, 2021 |
As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols. Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia. Byte order mark If the UTF-16 Unicode byte order mark (BOM,... See more As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols. Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia. Byte order mark If the UTF-16 Unicode byte order mark (BOM, U+FEFF) character is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF. https://en.wikipedia.org/wiki/UTF-8#Codepage_layout UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[1] UTF-8 is capable of encoding all 1,112,064[nb 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. https://en.wikipedia.org/wiki/UTF-8#Codepage_layout Hope this helps ▲ Collapse | | | Samuel Murray Netherlands Local time: 23:18 Member (2006) English to Afrikaans + ... TOPIC STARTER
Mikhail Zavidin wrote: As I can understand the string in question is UTF-8 encoded. Correct. If it was UTF-16LE, it would be "2000 FF00 2000" instead of "20 C3BF 20". So the order depends on BOM of the file if any. UTF-8 doesn't have a byte order (and the "byte order mark" added to UTF-8 files is called a "byte order mark" for historical reasons and not because it indicates a byte order (it doesn't indicate a byte order because UTF-8 doesn't have a byte order (or: has only one byte order, depending on how you explain it))). Anyway, the byte order (even if there was one) isn't really relevant to the question. I'm trying to figure out how I can tell just by looking at "20 C3 BF 20" that "20 C3" and "BF 20" are not characters, but that "20" is a character, "C3 BF" is a character, and "20" is a character? The problem is that sometimes a character is encoded as two digits and sometimes it is encoded as four digits, and I want to know how can I tell which is when. | | | Why not review the conversion table | Oct 20, 2021 |
Code point UTF-8 conversion First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 U+0000 U+007F 0xxxxxxx U+0080 U+07FF 110xxxxx 10xxxxxx U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx https://en.wikipedia.org/wiki/UTF-8#Encoding As I can understand, if the byte contains the ... See more Code point UTF-8 conversion First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4 U+0000 U+007F 0xxxxxxx U+0080 U+07FF 110xxxxx 10xxxxxx U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx https://en.wikipedia.org/wiki/UTF-8#Encoding As I can understand, if the byte contains the higher bits set to 110 (110xxxxx), this symbol consist of 2 bytes. Then if the byte contains the higher bits set to 1110 (1110xxxx) this symbol consist of 3 bytes. And so on, according with the above table. In your example the first 20 (00100000), contains 0 in higher bit of the byte, so this is single byte symbol from the first 127 symbols of the ASCII table. The second byte is C3, meaning 1100 0011 and represents a two byte symbol. And so on. Hope this helps
[Edited at 2021-10-20 10:14 GMT]
[Edited at 2021-10-20 10:17 GMT]
[Edited at 2021-10-20 10:18 GMT] ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to tell where letters begin and end in hex Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |