Question: Surround emoji with spaces

Question

Surround emoji with spaces

Answers 1
Added at 2016-12-05 18:12
Tags
Question

I used this post to make a regex that would find emojis in a string of text and simply stick some space characters on either side. my regex code:

try:
    # Wide UCS-4 build
    oRes = re.compile(u'['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    oRes = re.compile(u'('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+', 
        re.UNICODE)

s2 = oRE.sub(r'  \1  ', s1)

However, I am getting some really odd behaviour where emojis are being removed, as in the example below. Any advice would be appreciated. I am using Python on a MacBook. Thanks.

INPUT

هيلاري كلينتون "متنحة" وتشير إلى عملية غش في ولاية بانسيلفانيا العتيقة قائلة: "عند فرز الاصوات ..قطعوا الكهربا 😂✋" #ابو_الياس

OUTPUT

هيلاري كلينتون "متنحة" وتشير إلى عملية غش في ولاية بانسيلفانيا العتيقة قائلة: "عند فرز الاصوات ..قطعوا الكهربا ✋ " #ابو_الياس

Answers to

Surround emoji with spaces

nr: #1 dodano: 2016-12-05 21:12

The following works for me once I correct the placement of the round brackets in your regular expressions. In the try block, you need round brackets around the whole thing if you want to create the group \1 at all; in the except block, the round brackets need to include the +, otherwise the \1 group will only capture the first of multiple relevant characters.

import re
with open('input.txt', 'rb') as f:
    s1 = f.read().decode('utf-8').strip()

try:
    # Wide UCS-4 build
    oRes = re.compile(u'(['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+)', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    oRes = re.compile(u'(('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+)', 
        re.UNICODE)

s2 = oRes.sub(r'  \1  ', s1)

with open('output.txt', 'wb') as f:
    f.write((s1+'\n').encode('utf-8'))
    f.write((s2+'\n').encode('utf-8'))

As for the reversal of your characters, that must be an artifact of some step in your input/output or copy/paste chain not correctly handling the right-to-left nature of Arabic. It doesn't happen for me. The results look good when I open output.txt in TextWrangler on my MacBook.

Source Show
◀ Wstecz