utf8mb4_unicode_ci Collation - issues

  • Posts: 461
  • Thank you received: 36
4 years 3 weeks ago #325268

-- HikaShop version -- : 4.4.0
-- Joomla version -- : 3.9.22
-- PHP version -- : 7.3.23
-- Browser(s) name and version -- : last Chrome

Hi guys,
- on a DB set as latin1_swedish_ci
- with different Tables collation: latin1_swedish_ci / utf8_general_ci / utf8mb4_general_ci / utf8mb4_unicode_ci

First - I change the entire DB collation through phpMyAdmin, following the " Changing the default collation for a database " guide, by selecting:
utf8mb4_unicode_ci + "Change all tables collations" + "Change all tables columns collations"

Second - Admin Tools > Change Database Collation (on the existing tables) > Choose a collation > utf8mb4_unicode_ci > Apply

15 of all the 356 Tables are not moved and are always as utf8_general_ci
As you can see 13 are from HikaShop.

1 - Won't this create any kind of issues ? Please, How to fix these tables ?
Or at the moment I just have to ignore them and leave them as they are after the transformation ?

Now, I read about it and, as far as I understand (in particular many thanks to Nicholas' help, Akeeba), it is a MySQL limitation, but also a "bad implementation" from you, the technical explanation I found - I have the DB, the Table and the Column Collation / character set format. Often it depends on:

MySQL has a limit of 767 bytes for each index. Many extensions use a VARCHAR(255) column which is also indexed without a character limit.
This essentially tells MySQL to create an index with the maximum amount of data the column can hold. Plain old UTF-8 is up to 3 bytes per character and the column is set to use 255 characters therefore the index size is 3x255 = 765 bytes which is less than the 767 byte limit.

If this column were to be converted to UTF8MB4 each character can take up to 4 bytes (that's what MB4 means: MutiByte, up to 4 bytes per character). Therefore the index size would need to be 4*255 = 1020 bytes. However this is bigger than MySQL's 767 byte limit for indices. Therefore the table cannot be converted to UTF8MB4.

There are two solutions BUT they must be implemented by the developer of the affected extension, you:
- A - to set the column to VARCHAR(191). The index in this case is 4*191 = 764 bytes which is under MySQL's limit. This is perfectly fine for most practical cases. Very few title fields need to go that big, even in languages which use a massive amount of characters per word (e.g. German and Greek)
- B - to keep the columns as VARCHAR(255) but to set a character limit to the index. This is the recommended solution for performance reasons but its applicability depends on the data shape. For example, a developer may choose to create an index on the column but limit it to 64 characters. This will only be a problem if you expect to have multiple rows with their first 64 characters in common. In fact, often, since it is very difficult to determine when creating a mass distributed extension which can be used in contexts, the developer can't even imagine and he prefers to convert the column to VARCHAR(191).

Now, if this is the case, Joomla added UTF8MB4 support nearly three years ago. The real reason for UTF8MB4 support in Joomla is security, because it mitigates a lot of trickery which could lead in SQL injections. Joomla 4 standard is (the old) utf8mb4_unicode_ci.
You should already have addressed it and I'm sure you are on it. It seems really a simple implementation and I'm really surprised that here we are still speaking about "utf8_general_ci", Do you agree ?

2 - Please, What is your plan on it ?

Last edit: 4 years 3 weeks ago by joomleb.

Please Log in or Create an account to join the conversation.

  • Posts: 82863
  • Thank you received: 13372
  • MODERATOR
4 years 3 weeks ago #325333

Hi,

This was discussed around 2 years ago:
www.hikashop.com/support/forum/product-c...704-emoji-icons.html

First, it is not a "bad implementation" from us. utf8_general_ci is the most widely supported encoding across all the versions of MySQL. Yes, it's old, that's why it's the most widely supported.
Any drawback of using it ? A bit for special emoticon characters. It's clearly not a game-changer to migrate to utf8mb4.
Would it be nice ? Yes, for sure.
We actually tried to change to utf8mb4 after this thread.
Unfortunately, we found issues in some rare cases, which lead us to revert that change.

But I don't see why you say "they must be implemented by the developer of the affected extension" ?
Changing the length of the columns or the length of the index in your tables in your database can be done on your end via your PHPMyAdmin. And if you do, HikaShop won't have a problem with shorter columns / indexes.
So you should be able to do that for your website, without having us do anything.

Now unless you want to enter special emojis in the description of your products, or you want a clean database where all the tables have the same collation, it won't change much for you to do that.
I would personally leave it "as is" and not bother.
Having latin1_swedish_ci is a much more important issue as you just can't write the characters of a lot of languages with it. But even that, as long as it's doesn't concern the languages of your website and the tables where the customers can enter text have utf8, you could ignore.

Please Log in or Create an account to join the conversation.

  • Posts: 461
  • Thank you received: 36
4 years 3 weeks ago #325344

Hi Nicolas,
many thanks for your answer. Before to going on,
in the link you gave me Jerome wrote: "...We won't change the encoding during updates but new installation would have the utf8mb4 encoding..."

A - So, HikaShop new installations are using utf8mb4_general_ci OR utf8mb4_unicode_ci collation ?

"...Changing the length of the columns or the length of the index in your tables in your database can be done on your end via your PHPMyAdmin. And if you do, HikaShop won't have a problem with shorter columns / indexes. So you should be able to do that for your website, without having us do anything..."

B - I should manually do that on the 13 tables mentioned above. Am I right ?
C - Do you mean by manually changing all the VARCHAR(255) to VARCHAR(191) type columns within the 13 tables ? Or are there other columns to change within the 13 tables?
D - The best thing would be to do the manual change between the first and second step, Am I right?

Please Log in or Create an account to join the conversation.

  • Posts: 82863
  • Thank you received: 13372
  • MODERATOR
4 years 2 weeks ago #325352

Hi,

A. No. As I said, we had to revert that a few weeks after adding that as some users had issues with that change.

B. Yes.

C. You don't necessarily need to change the length of the columns. You could change the length used by the indexes in these tables.
That way, you won't have to truncate any data you might have longer than 191 characters.
subscription.packtpub.com/book/web_devel...c08/index-management

D. "manual change between the first and second step". I'm sorry but I don't understand what you mean by "first and second step". So I'm not sure what you're asking here.

Please Log in or Create an account to join the conversation.

  • Posts: 461
  • Thank you received: 36
4 years 2 weeks ago #325387

Hi Nicolas,
many thanks for help:

C - I'm going to test and let you know here soon...

D - Please, about "first" and "second" steps, let see my first topic here where I describe them...

E - I want to share with you a constructive discussion I'm having with Nicholas (Akeeba) supporting Admin Tools where we are speaking also about the security reasons, you can read for complete at this link , anyway an important piece:

"...Now, I have two comments to make and please link these third party developers to this reply of mine.

First, I'd like to say that everything I have said here and everything I am about to say is already explained in minute detail in my PR which added backwards compatible UTF8MB4 support to Joomla 3.5.0 , submitted 5 ½ years ago on June 2015.

Regarding the practical considerations of UTF8MB4 vs UTF8 support in each extension, it's about which characters can be stored with each collation. When using a utf8_*_ci collation you are essentially only supporting the Unicode Basic Multilingual Plane (BMP) . This means that four byte Unicode characters such as Emoji and extended Chinese, Japanese and Korean characters are not supported and instead converted to question marks at best or result in truncated data at worst. Joomla has supported UTF8MB4 for more than five years and people expect these characters to work correctly with it and all of its extensions. There is no good reason why I shouldn't be able to send a mailing that contains an Emoji, make a product description that includes some extended Korean character or have automatic translation fail when using Unicode aliases in Global Configuration for an article whose title uses some less often used traditional Chinese character.

As to how they can convert their extensions, all they have to do is stop supporting Joomla 3.4 and earlier. I would say that's a reasonable thing to do considering that these versions of Joomla are less than 0.4% of all installed Joomla sites. Joomla 3.5 and later allows you to simply write your SQL queries using the utf8mb4_unicode_ci collation and it will magically and automatically convert that to utf8_unicode_ci collation if the database server and/or PHP MySQL driver used on the site do not support utf8mb4. From a third party developer's perspective you DO NOT have to worry about whether the database server supports utf8mb4 or not. That's Joomla's problem. In fact, Joomla's own SQL code does the same – it uses utf8mb4 . It still works on old versions of MySQL and PHP which don't support UTF8MB4. This alone should have gotten these developers thinking instead of waving off utf8mb4 support as infeasible and prompt them to take a look at Joomla's code where they would have easily spotted the convertUtf8mb4QueryToUtf8() method in the core Joomla database driver code.

So, it's absolutely feasible, it is necessary and there are no more excuses not to do it. If they really want to support Joomla 3.0 to 3.4 (less than 0.4% of Joomla installations) they can of course create two separate packages the only difference between them being the SQL files. This can be trivially automated, so it's not even a valid excuse. If they still refuse to upgrade to utf8mb4 then the problem is with them and only them...."

Last edit: 4 years 2 weeks ago by joomleb.

Please Log in or Create an account to join the conversation.

  • Posts: 82863
  • Thank you received: 13372
  • MODERATOR
4 years 2 weeks ago #325389

Hi,

D. So yes, exactly. In between both steps.

E. Thank you for the feedback.

Please Log in or Create an account to join the conversation.

  • Posts: 461
  • Thank you received: 36
4 years 2 weeks ago #325491

Hi Nicolas,

A - So, right now HikaShop / HikaMarket / HikaSubscription are working with utf8_general_ci collation on all their tables, Right ?

Please Log in or Create an account to join the conversation.

  • Posts: 4747
  • Thank you received: 644
  • MODERATOR
4 years 2 weeks ago #325560

Hello,

Exactly, that's correct.
Regards

Please Log in or Create an account to join the conversation.

  • Posts: 461
  • Thank you received: 36
4 years 1 week ago #325649

Hi Nicolas,
C - I read and made some tests. Considering that the tables site are quite empty. As far as I understand:
- by deleting the mentioned tables VARCHAR255 indexes
- to run first and second steps (as wrote above)
- recreating the deleted indexes

the limit issue during utf8 to utf8mb4 moving is avoided, Am I right?
And I don't have to add any characters limit (765) on those indexes, Am I right ?
OR, Should I have to add a 765 limit when recreating the mentioned indexes ?

Please Log in or Create an account to join the conversation.

  • Posts: 461
  • Thank you received: 36
4 years 1 week ago #325855

Hi Nicolas, Hi Philip,
I completed all tests and investigations. I politely want to share with you these info that can save your time (for a utf8 > utf8mb4 moving) as a thank you for your availability:

From MySQL 5.7 (MySQL 5.7.7 I think) the prefix limitation should be 3072 and not 1000 (or 767). On MariaDB it should be from 10.2 release - stackoverflow.com/a/1814594

So, during a utf8 > utf8mb4 moving no prefix index limits are needed and it works like a charm.

That is not totally true because of a MyISAM related bug that should have been fixed with MariaDB 10.4.3:
- jira.mariadb.org/browse/MDEV-18061
- jira.mariadb.org/browse/MDEV-371

So, if MariaDB < 10.4.3, whem MyISAM, we have to add the limit, while in case of InnoDB it is not necessary

Have a good weekend!

Last edit: 4 years 1 week ago by joomleb.

Please Log in or Create an account to join the conversation.

Time to create page: 0.067 seconds
Powered by Kunena Forum