Home · All Classes · Main Classes · Deprecated |
implements charset or encoding detection More...
Public Member Functions | |
MCharsetDetector () | |
MCharsetDetector (const QByteArray &ba) | |
MCharsetDetector (const char *str) | |
MCharsetDetector (const char *data, int size) | |
virtual | ~MCharsetDetector () |
bool | hasError () const |
void | clearError () |
QString | errorString () const |
void | setText (const QByteArray &ba) |
MCharsetMatch | detect () |
QList< MCharsetMatch > | detectAll () |
QString | text (const MCharsetMatch &charsetMatch) |
void | setDeclaredLocale (const QString &locale) |
void | setDeclaredEncoding (const QString &encoding) |
QStringList | getAllDetectableCharsets () |
void | enableInputFilter (const bool enable) |
bool | isInputFilterEnabled () |
implements charset or encoding detection
MCharsetDetector provides a facility for detecting the charset or encoding of of unknown character data. The input data comes from an array of bytes (a QByteArray). To some extent it also detects the language.
Character set detection involves some statistics and guesswork, therefore the results cannot be guaranteed to always be correct.
For best accuracy, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed.
Optionally it is possible to ignore html or xml style markup that could otherwise obscure the content.
Example:
// create two sample encoded strings // (the sample strings here are a bit short, if possible one // should use longer input for detection, this is just // to show how it works. But the short input in this example // does actually work): QTextCodec *codecUtf8 = QTextCodec::codecForName("UTF-8"); QTextCodec *codecEucjp = QTextCodec::codecForName("EUC-JP"); QByteArray encodedStringUtf8 = codecUtf8->fromUnicode(QString::fromUtf8("日本語")); QByteArray encodedStringEucjp = codecEucjp->fromUnicode(QString::fromUtf8("日本語")); // now try to detect the encoding of the sample strings: MCharsetDetector charsetDetector(encodedStringUtf8); if(charsetDetector.hasError()) qWarning() << "an error happened" << charsetDetector.errorString(); MCharsetMatch bestMatch = charsetDetector.detect(); if(charsetDetector.hasError()) qWarning() << "an error happened" << charsetDetector.errorString(); // print match found: qDebug() << bestMatch.name(); // this will print "UTF-8" qDebug() << bestMatch.language(); // this will print "" qDebug() << bestMatch.confidence(); // this will print "100" // decode input string into a QString, using the encoding // detected as best best match: QString result = charsetDetector.text(bestMatch); // try another detection using the same MCharsetDetector object // (it saves some time not creating a new one all the time): charsetDetector.setText(encodedStringEucjp); if(charsetDetector.hasError()) qWarning() << "an error happened" << charsetDetector.errorString(); QList<MCharsetMatch> mCharsetMatchList = charsetDetector.detectAll(); if(charsetDetector.hasError()) qWarning() << "an error happened" << charsetDetector.errorString(); // print all matches found: for(int i = 0; i < mCharsetMatchList.size(); ++i) { qDebug() << i << ":" << mCharsetMatchList[i].name() // for i==0, “EUC-JP” is printed << mCharsetMatchList[i].language() // for i==0, “ja” is printed << mCharsetMatchList[i].confidence(); } // decode input string into a QString using the encoding detected // in the first (i.e. best )match (if there was a match at all): if(!mCharsetMatchList.isEmpty()) result = charsetDetector.text(mCharsetMatchList.first());
ML10N::MCharsetDetector::MCharsetDetector | ( | ) |
constructs a MCharsetDetector without text content
ML10N::MCharsetDetector::MCharsetDetector | ( | const QByteArray & | ba | ) |
constructs a MCharsetDetector with a initial text content
ba | an array of bytes with the initial input text |
ML10N::MCharsetDetector::MCharsetDetector | ( | const char * | str | ) | [explicit] |
constructs a MCharsetDetector with a initial text content
str | a string with the initial input text |
ML10N::MCharsetDetector::MCharsetDetector | ( | const char * | data, | |
int | size | |||
) |
constructs a MCharsetDetector with a initial text content
str | a string with the initial input text |
ML10N::MCharsetDetector::~MCharsetDetector | ( | ) | [virtual] |
destructor for MCharsetDetector
void ML10N::MCharsetDetector::clearError | ( | ) |
clears any error which might have occurred during previous action
Useful if one wants to ignore an error and continue.
MCharsetMatch ML10N::MCharsetDetector::detect | ( | ) |
QList< MCharsetMatch > ML10N::MCharsetDetector::detectAll | ( | ) |
detects a list of possible encodings
returns a list of MCharsetMatch objects with charsets which appear to be consistent with the input. The results are ordered according to their confidence level, the match with the highest confidence comes first in the list. I.e. the first element in this list is always the same as the return value of detect().
The detection only looks at a limited amount of the input byte data, thus but charsets detected might fail to handle all of the input data. But this is checked again after the detection and charsets which really do fail to handle all of the input data are removed again from list of detected encodings.
I.e. the charsets in the list returned by detectAll() will always handle all of the input data.
void ML10N::MCharsetDetector::enableInputFilter | ( | const bool | enable | ) |
enable or disable filtering of input text.
selects | whether to enable or disable the filtering |
If filtering is enabled, text within angled brackets (“<” and “>”) will be removed before detection, which will remove most HTML or xml markup.
This does not use a real HBML or xml parser, it only looks for angled brackets. If less then 5 “<” characters are found or a significant amount of illegally nested “<” characters are found or the input text looks like it is essentially nothing but markup, the filtering is abandoned and detection is done with the unstripped input.
QString ML10N::MCharsetDetector::errorString | ( | ) | const |
text describing the error occurred during the last action
Example:
MCharsetDetector detector; if(detector.hasError()) qWarning() << detector.errorString();
QStringList ML10N::MCharsetDetector::getAllDetectableCharsets | ( | ) |
returns a list of detectable charsets
This does not depend on the state of the MCharsetDetector object, it always returns the same list.
bool ML10N::MCharsetDetector::hasError | ( | ) | const |
checks whether an error occurred during the last action
bool ML10N::MCharsetDetector::isInputFilterEnabled | ( | ) |
test whether input filtering is enabled or not
void ML10N::MCharsetDetector::setDeclaredEncoding | ( | const QString & | encoding | ) |
set the declared encoding for charset detection.
a | charset name to give as a hint |
This can be used as an additional hint for the charset detector.
This is passed through to the libicu encoding detection system which currently does not do anthing with that information. But MCharsetDetector uses it to improve the matches returned by the libicu encoding detection.
void ML10N::MCharsetDetector::setDeclaredLocale | ( | const QString & | locale | ) |
set the declared locale for charset detection
a | locale name to give as a hint |
This can be used as an additional hint for the charset detector.
For example, setDeclaredLocale("zh") may help to detect Chinese legacy encodings better, setDeclaredLocale("zh_CN") may further improve this for simplified Chinese legacy encodings, setDeclaredLocale("ru") may help for detecting Russian legacy encodings better, etc. ...
void ML10N::MCharsetDetector::setText | ( | const QByteArray & | ba | ) |
sets the input byte data whose charset is to detected.
QString ML10N::MCharsetDetector::text | ( | const MCharsetMatch & | charsetMatch | ) |
get the entire input text converted to the encoding of a match
Copyright © 2010 Nokia Corporation | MeeGo Touch |