How to tell where letters begin and end in hex
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:18
Member (2006)
English to Afrikaans
+ ...
Oct 19, 2021

Hello everyone

The hex for " ÿ " is "20 C3 BF 20". The spaces are "20" and the "ÿ" is "C3 BF". How can I tell that "20" is the first letter, and not "20 C3"? And how can I tell that "C3 BF" is the second letter, and not just "C3"?

Thanks
Samuel


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 00:18
English to Russian
+ ...
It is all about UTF-8 and BOM (Byte order mark) Oct 19, 2021

As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM,
... See more
As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM, U+FEFF) character is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[1]

UTF-8 is capable of encoding all 1,112,064[nb 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

Hope this helps
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:18
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Mikhail Oct 19, 2021

Mikhail Zavidin wrote:
As I can understand the string in question is UTF-8 encoded.


Correct. If it was UTF-16LE, it would be "2000 FF00 2000" instead of "20 C3BF 20".

So the order depends on BOM of the file if any.


UTF-8 doesn't have a byte order (and the "byte order mark" added to UTF-8 files is called a "byte order mark" for historical reasons and not because it indicates a byte order (it doesn't indicate a byte order because UTF-8 doesn't have a byte order (or: has only one byte order, depending on how you explain it))). Anyway, the byte order (even if there was one) isn't really relevant to the question.

I'm trying to figure out how I can tell just by looking at "20 C3 BF 20" that "20 C3" and "BF 20" are not characters, but that "20" is a character, "C3 BF" is a character, and "20" is a character? The problem is that sometimes a character is encoded as two digits and sometimes it is encoded as four digits, and I want to know how can I tell which is when.


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 00:18
English to Russian
+ ...
Why not review the conversion table Oct 20, 2021

Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the
... See more
Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the higher bits set to 110 (110xxxxx), this symbol consist of 2 bytes.
Then if the byte contains the higher bits set to 1110 (1110xxxx) this symbol consist of 3 bytes.
And so on, according with the above table.

In your example the first 20 (00100000), contains 0 in higher bit of the byte, so this is single byte symbol from the first 127 symbols of the ASCII table. The second byte is C3, meaning 1100 0011 and represents a two byte symbol.
And so on.

Hope this helps

[Edited at 2021-10-20 10:14 GMT]

[Edited at 2021-10-20 10:17 GMT]

[Edited at 2021-10-20 10:18 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

How to tell where letters begin and end in hex






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »