Punjabi - Shahmukhi Script

Shahmukhi Punjabi
Home   |    Back to PMT Home    |    Back to SDMT Home


Muslims had started to influence the Indian Subcontinent since 9th century. Shahmukhi derives its character set from Persian/Arabic scripts. Its use to transcribe Punjabi commenced around 10th and 11th century after the Mughal conquest and establishment of vast empire in Indian Subcontinent. It is a right to left script and the shape assumed by a character in a word is context sensitive and is used for Punjabi in Pakistan. In Unicode, Arabic and its associative languages like Punjabi, Urdu etc. have been allocated 1,200 code points (0600h - 06FFh, FB50h - FEFFh) and most Shahmukhi characters are already in Unicode, but a few characters are missing.

Punjabi is the mother tongue of more than 110 million people of Pakistan (66 million), India (44 million) and many millions in America, Canada and Europe. It has been written in two mutually incomprehensible scripts Shahmukhi and Gurmukhi for centuries. Punjabis from Pakistan are unable to comprehend Punjabi written in Gurmukhi and Punjabis from India are unable to comprehend Punjabi written in Shahmukhi. In contrast, they do not have any problem to understand the verbal expression of each other. Punjabi Machine Transliteration (PMT) system is an effort to bridge the written communication gap between the two scripts for the benefit of the millions of Punjabis around the globe.

Punjabi is the language of this vast area. The language has a rich literary past. The first known poet of Punjab (Baba Farid) was in 12th century but most of the luminaries are from past five hundred years. It is written in different scripts by different people and many varieties now stake claim of being an independent language, for example, Potohari, Pahari, Dogri, Hindko and Saraiki. The matter however remains undecided.

Literary History of Punjabi

Pakistani Punjab's population according to 1998 census was 73.62 million. Projected population for 2005 is 84.85 million. 68.7 percent of these lives in villages with agriculture being the dominant source of livelihood. Following are the numbers of speakers of different languages in Pakistani Punjab.

Major Languages of Pakistan
Language Percenage of Speakers Number of Speakers
Source: Census 2001: Table 2.7. The population is assumed to be 150 million in 2003 as it was 132,352,000 in 1998 and the growth rate is 2.69 %.

Punjab Census Report 2001





Features of Urdu Script

The distinguishing characteristics of the Shahmukhi are discussed for the benefit of the unacquainted reader. Punjabi is greatly influenced by Arabic and Persian languages. Shahmukhi derived its character set from Persian and Arabic and its character set is a super set of Arabic and Persian alphabets and contains 43 basic characters and 15 diacritical marks. Figure 1 shows the alphabets of Shahmukhi. Unlike English, the characters do not have upper and lower case.

Code page

Figure 1: Character set of Shahmukhi

Further, the shape assumed by a character in a word is context sensitive i.e. the shape is different depending whether the position of the character is at the beginning, in the middle or at the end of the constituent word. This generates three shapes, the fourth being the independent shape of the character. Figure 2 gives these four shapes for a character, named Bey.

Conextual Shapes

Figure 2: Context sensitive Shapes of Bey

To be precise, the above is true for all except eleven characters. Ten of these have only two shapes; the independent and the terminating shape, these characters are shown in Figure 3. These characters have independent and final shapes when they come at beginning and middle of a word respectively.

Characters having only two shapes

Figure 3: Characters having only Independent and Final Shape

Hamza never comes at the beginning of a word, but it comes in the beginning of a ligature. Also it attains the independent shape instead of the final shape when it comes at the end of the word. Owing to this, it has initial, middle and independent shapes. It is illustrated in figure 4.

Hamza shapes

Figure 4: Shapes of Hamza, (Circled, right to left) Independent, Initial and Middle shape

Punjabi is traditionally written in Nastaleeq, a script rich in calligraphic content. Owing to complexities of rendering, the basic shapes identified above are unable to render the language in an acceptable form in Nasta'leeq. The characters of Punjabi also need diacritics to help in the proper pronunciation of the constituent word. The diacritics appear above or below a character to define a vowel or emphasize a particular sound. These diacritical marks are basis of the vowel system in Shahmukhi. There are a number of diacritics, the common ones being Zabar, Zer, and Pesh. Figure 5 shows the character Bey marked with these diacritics.


Figure 5: Bey with Diacritics

Figure 6 shows Punjabi text in Nastaleeq script with diacritics placed on the respective characters.

Shahmukhi Sentence

Figure 6: Punjabi (Shahmukhi) Text in Nastaleeq

Diacritics, though part of the language, are sparingly used. They are essential for removing ambiguities, natural language processing and speech synthesis. Thus, (a) the multiple shapes of characters, (b) the complexities of the traditional script of Punjabi and (c) the existence of diacritics, are major factors that contributed to the difficulties in formulating a standard for Punjabi.

Diacritics, though part of the language, are sparingly used. They are essential for removing ambiguities, natural language processing and speech synthesis.


  • http://www.wordiq.com/definition/Persian_language
  • Khaver Zia (1999) "Standard Code Table for Urdu", in the proceedings of 4th Symposium on Multilingual Information Processing (MLIT-4), Yangon, Myanmar, CICC, japan.
  • Malik, M G Abbas. 2006. Punjabi Machine Transliteration. in proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, July 17 - 21, 2008, Manchester, UK. pdf
  • Malik, M G Abbas. 2005. Towards a Unicode Compatible Punjabi Characterset. in proceedings of the 27th Internationalization and Unicode Conference, April, Berlin, Germany. pdf
  • Rahman, Tariq. 2004. Language Policy and Localization in Pakistan: Proposal for a Paradigmatic Shift. Crossing the Digital Divide, SCALLA Conference on Computational Linguistics.
Home | Back to PMT Home | Back on Top