Encode() method of str class in Python

Overview:

  • Specific ways of using the bit patterns (the bits from one or more bytes) to represent the numeric values that are mapped to characters like "A", "B", "∫" are called encoding mechanisms.

  • ASCII and Unicode are examples of mapping schemes while ASCII, UTF-8, UTF-16 are examples of encoding schemes. ASCII is both a mapping and encoding scheme.

  • The method encode() converts a set of Unicode code points to bytes as per a given encoding scheme. The scheme of encoding is UTF-8, as defined by the parameter, encoding.

  • In Python, text is stored as a string of code point values (as defined by Unicode) using a dynamic encoding scheme that assumes UTF-8 as the default. The encoding mechanism chosen is dynamic to accommodate encodings like UTF-16 or UTF-32 based on the nature of input text. If the input is a stream of bytes from a specific encoding scheme, it can be converted back to string using decode().
  • Understanding of method encode() requires a quick recap of the basics.
    • 1 bit can represent two values 0 and 1.
    • 2 bits can represent four values.   
    • 3 bits can represent eight values. An octal number, which is of base 8 requires 3 bits for its representation.
    • 4 bits can represent sixteen values. A hexadecimal number, which is of base 16, requires 4 bits.
    • 7 bits can represent 128 values and 8 bits can represent 256 values. ASCII is an encoding scheme of 7 bits.
    • Unicode is a specification that maps characters of all international languages including special characters and emojis to distinct numeric values called code points. These numeric values can be processed using any of the schemes like UTF-8, UTF-16 and UTF-32.
      • UTF-8 is a variable length encoding. UTF uses 1 bytes to 4 bytes based on the value of the code point. Each byte is capable of conveying how many more bytes follow it or whether the byte itself is a continuation byte. The size of the code unit is 8 here (ie., one byte).
      • UTF-16 uses 16 bits to denote code points that are less than 65536 including the ASCII character set (characters whose code point is from 0 to 127). Two 16 bits will be used to denote characters whose Unicode code points are greater than 65536. The size of the code unit is 16 here (i.e., Two bytes).

Example 1:

# Example Python program that encodes a string using UTF-8
 

quoted  = "Continuous as the stars that shine";
encoded = quoted.encode();
print(encoded);
print(type(encoded));
print(encoded.hex("-"));

Output:

b'Continuous as the stars that shine'

<class 'bytes'>

43-6f-6e-74-69-6e-75-6f-75-73-20-61-73-20-74-68-65-20-73-74-61-72-73-20-74-68-61-74-20-73-68-69-6e-65

Example 2 - Using an error handler to process the UnicodeEncodeError during encoding:

# Example Python program that uses an error handler
# to handle encoding errors
import codecs

# Custom error handler that replaces unencodable parts 
# of the string with a '#'
def onError(error):
     # returning tuple of # for the unencodable 
     # and next position for the encoding to continue
    return ('#', error.start + 1);

# Characters in Japanese
helloText  = "こんにちは123";

# Specify an error handler
encoded = helloText.encode(encoding="ascii", errors='replace');
print(encoded);

# Register a custom error handler
codecs.register_error("hashreplace", onError);

# Use the custom error handler
encoded = helloText.encode(encoding="ascii", errors="hashreplace");
print(encoded);

 


Copyright 2024 © pythontic.com