JSON Encoding Issue

Short:

We get the following error message when sending valid JSON with a Character Encoding other than UTF-8:

{
“status”: false,
“error”: “Invalid or malformed JSON: control character error, possibly incorrectly encoded”
}

Long:

The current JSON Spec explicitly allows for Character Encoding other than UTF-8. This is stated in rfc4627, section-3. JSON Spec: Encoding. It states that:


   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.


This is currently not the case with the Pipedrive API, which does not accept any other encoding then UTF-8. (To be honest, I only tested with “activities”, but assume this applies to all).

How to Reproduce

Here is a script, on how to reproduce the issue. The “Super Secret” Elements obviously will have to be replaced…

<?php

// Pipedrive API token
$api_token = '<Super Secret>';
// Pipedrive company domain
$company_domain = '<Super Secret>';
$deal_id = <Super Secret>;

// Payload
$data = array(
    'subject' => 'Test Encoding',
    'type' => 'lunch',
    'deal_id' => $deal_id,
    "public_description" => "This is a test description",
    'note' => "This is a test note"
);


// URL for adding an Activity
$url = 'https://' . $company_domain . '.pipedrive.com/api/v1/activities?api_token=' . $api_token;

// UTF-16BE (Example. Could be anything like -16xx, -32xx)
$data = mb_convert_encoding(json_encode($data), 'UTF-16BE');
$payload = $data;

//
$ch = curl_init();


curl_setopt_array($ch, [
    CURLOPT_URL => $url,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => '',
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 0,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_CUSTOMREQUEST => 'POST',
    CURLOPT_POSTFIELDS =>$payload,
    CURLOPT_HTTPHEADER => [
        'Content-Type: application/json'
    ],
]);


$output = curl_exec($ch);
print_r(curl_getinfo($ch));
print_r($output);
curl_close($ch);

Question:

Is there any hope, that this will be fixed - or better: other encoding will be supported - anytime soon?

Hello @Thomas_Lange

No plans of supporting other encodings at the moment due to the UTF-8 being the de-facto standard. But I would like to ask you why would you need an encoding other than UTF-8?

Hello @riin,

thank you for your reply. It ain’t us. We use a third party tool provider, who happens to send encoding other than UTF-8.

As we use pipedrive and them, we are in quite a predicament here and will have to write a small
API to adjust the formatting… I assume…

Thank you for your time,
Thomas

Hey Thomas.

I think the problem is that you’re trying to change charset of already json-encoded payload. So payload becomes something like

string(268) "\000{\000"\000s\000u\000b\000j\000e\000c\000t\000"\000:\000"\000T\000e\000s\000t\000 \000E\000n\000c\000o\000d\000i\000n\000g\000"\000...

Notice that \000 before control character {. I guess our backend doesn’t understand that.

I would suggest to encode individual values

$data = array(
// notice change here..
    'subject' => mb_convert_encoding('Test Encoding', 'UTF-16BE'),
    'type' => 'lunch',
    'deal_id' => $deal_id,
    "public_description" => mb_convert_encoding("This is a test description", 'UTF-16BE'),
    'note' => mb_convert_encoding("This is a test note", 'UTF-16BE')
);

Then the payload will be valid

string(482) "{"subject":"\u0000T\u0000e\u0000s\u0000t\u0000 \u0000E\u0000n\u0000c\u0000o\u0000d\u0000i\u0000n\u0000g","type":"lunch","deal_id":31,"public_description":"\u0000T\u0000h\u0000i\u0000s\u0000 \u0000i\u0000s\u0000 \u0000a\u0000 \u0000t\u0000e\u0000s\u0000t\u0000 \u0000d\u0000e\u0000s\u0000c\u0000r\u0000i\u0000p\u0000t\u0000i\u0000o\u0000n","note":"\u0000T\u0000h\u0000i\u0000s\u0000 \u0000i\u0000s\u0000 \u0000a\u0000 \u0000t\u0000e\u0000s\u0000t\u0000 \u0000n\u0000o\u0000t\u0000e"}"

Hey Artjom,

thx for your reply and for taking the time.

I know that the String itself - not it’s values - is UTF-16LE encoded.
And we do not have access to the point, where the json is generated ourselves. It is done by a third-party provider and (what’s worse) - it happens at random. I think, whenever there is a non UTF-8 character present in a HTML Text Field. (I asume some ppl have leftovers and copy and paste it…). This results in the third-party-provider to encode the whole payload in UTF-16LE (or BE - not sure).

So “I would suggest to encode individual values” is not an option. Very very sadly…

But as UTF-16/32 is valid - for the whole JSON stirng, that is what I wanted to ask about. The spec (link see above) gives these starging bytes:

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

As JSON will always start with a ASCII Char, it is possible to guess the encoding based on the first 4 Bytes… So the payload in the example request will indeed look somthing like this: 00 xx 00 xx - which matches your description.

But I understand the decision, to only support UTF-8 - as @riin correctly stated - that it is the default standard…
But I kind’a hoped for UTF-16/32 support… and hope dies last :wink:
I think we will have to write our own webhook to encode in UTF-8 …

Thx and a nice week to you,
Thomas