Home · All Classes · Main Classes · Deprecated
Public Member Functions

MCharsetDetector Class Reference

implements charset or encoding detection More...

List of all members.

Public Member Functions

 MCharsetDetector ()
 MCharsetDetector (const QByteArray &ba)
 MCharsetDetector (const char *str)
 MCharsetDetector (const char *data, int size)
virtual ~MCharsetDetector ()
bool hasError () const
void clearError ()
QString errorString () const
void setText (const QByteArray &ba)
MCharsetMatch detect ()
QList< MCharsetMatchdetectAll ()
QString text (const MCharsetMatch &charsetMatch)
void setDeclaredLocale (const QString &locale)
void setDeclaredEncoding (const QString &encoding)
QStringList getAllDetectableCharsets ()
void enableInputFilter (const bool enable)
bool isInputFilterEnabled ()

Detailed Description

implements charset or encoding detection

MCharsetDetector provides a facility for detecting the charset or encoding of of unknown character data. The input data comes from an array of bytes (a QByteArray). To some extent it also detects the language.

Character set detection involves some statistics and guesswork, therefore the results cannot be guaranteed to always be correct.

For best accuracy, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed.

Optionally it is possible to ignore html or xml style markup that could otherwise obscure the content.

Example:

 // create two sample encoded strings
 // (the sample strings here are a bit short, if possible one
 // should use longer input for detection, this is just
 // to show how it works. But the short input in this example
 // does actually work):
 QTextCodec *codecUtf8 = QTextCodec::codecForName("UTF-8");
 QTextCodec *codecEucjp = QTextCodec::codecForName("EUC-JP");
 QByteArray encodedStringUtf8
     = codecUtf8->fromUnicode(QString::fromUtf8("日本語"));
 QByteArray encodedStringEucjp
     = codecEucjp->fromUnicode(QString::fromUtf8("日本語"));

 // now try to detect the encoding of the sample strings:
 MCharsetDetector charsetDetector(encodedStringUtf8);
 if(charsetDetector.hasError())
     qWarning() << "an error happened" << charsetDetector.errorString();
 MCharsetMatch bestMatch = charsetDetector.detect();
 if(charsetDetector.hasError())
     qWarning() << "an error happened" << charsetDetector.errorString();
 // print match found:
 qDebug() << bestMatch.name();       // this will print "UTF-8"
 qDebug() << bestMatch.language();   // this will print ""
 qDebug() << bestMatch.confidence(); // this will print "100"
 // decode input string into a QString, using the encoding
 // detected as best best match:
 QString result = charsetDetector.text(bestMatch);

 // try another detection using the same MCharsetDetector object
 // (it saves some time not creating a new one all the time):
 charsetDetector.setText(encodedStringEucjp);
 if(charsetDetector.hasError())
     qWarning() << "an error happened" << charsetDetector.errorString();
 QList<MCharsetMatch> mCharsetMatchList = charsetDetector.detectAll();
 if(charsetDetector.hasError())
     qWarning() << "an error happened" << charsetDetector.errorString();
 // print all matches found:
 for(int i = 0; i < mCharsetMatchList.size(); ++i) {
     qDebug() << i << ":"
              << mCharsetMatchList[i].name()     // for i==0, “EUC-JP” is printed
              << mCharsetMatchList[i].language() // for i==0, “ja” is printed
              << mCharsetMatchList[i].confidence();
 }
 // decode input string into a QString using the encoding detected
 // in the first (i.e. best )match (if there was a match at all):
 if(!mCharsetMatchList.isEmpty())
     result = charsetDetector.text(mCharsetMatchList.first());

Constructor & Destructor Documentation

MCharsetDetector::MCharsetDetector (  ) 
MCharsetDetector::MCharsetDetector ( const QByteArray ba  ) 

constructs a MCharsetDetector with a initial text content

Parameters:
ba an array of bytes with the initial input text
See also:
MCharsetDetector()
MCharsetDetector(const char *str)
MCharsetDetector(const char *data, int size)
setText(const QByteArray &ba)
MCharsetDetector::MCharsetDetector ( const char *  str  )  [explicit]

constructs a MCharsetDetector with a initial text content

Parameters:
str a string with the initial input text
See also:
MCharsetDetector()
MCharsetDetector(const QByteArray &ba)
MCharsetDetector(const char *data, int size)
setText(const QByteArray &ba)
MCharsetDetector::MCharsetDetector ( const char *  data,
int  size 
)

constructs a MCharsetDetector with a initial text content

Parameters:
str a string with the initial input text
See also:
MCharsetDetector()
MCharsetDetector(const QByteArray &ba)
MCharsetDetector(const char *str)
setText(const QByteArray &ba)
MCharsetDetector::~MCharsetDetector (  )  [virtual]

destructor for MCharsetDetector


Member Function Documentation

void MCharsetDetector::clearError (  ) 

clears any error which might have occurred during previous action

Useful if one wants to ignore an error and continue.

See also:
hasError()
errorString()
MCharsetMatch MCharsetDetector::detect (  ) 

detects the most likely encoding

returns an MCharsetMatch object.

See also:
detectAll()
QList< MCharsetMatch > MCharsetDetector::detectAll (  ) 

detects a list of possible encodings

returns a list of MCharsetMatch objects with charsets which appear to be consistent with the input. The results are ordered according to their confidence level, the match with the highest confidence comes first in the list. I.e. the first element in this list is always the same as the return value of detect().

The detection only looks at a limited amount of the input byte data, thus but charsets detected might fail to handle all of the input data. But this is checked again after the detection and charsets which really do fail to handle all of the input data are removed again from list of detected encodings.

I.e. the charsets in the list returned by detectAll() will always handle all of the input data.

See also:
detect()
void MCharsetDetector::enableInputFilter ( const bool  enable  ) 

enable or disable filtering of input text.

Parameters:
selects whether to enable or disable the filtering

If filtering is enabled, text within angled brackets (“<” and “>”) will be removed before detection, which will remove most HTML or xml markup.

This does not use a real HBML or xml parser, it only looks for angled brackets. If less then 5 “<” characters are found or a significant amount of illegally nested “<” characters are found or the input text looks like it is essentially nothing but markup, the filtering is abandoned and detection is done with the unstripped input.

See also:
isInputFilterEnabled()
QString MCharsetDetector::errorString (  )  const

text describing the error occurred during the last action

Example:

 MCharsetDetector detector;
 if(detector.hasError())
   qWarning() << detector.errorString();
See also:
hasError()
clearError()
QStringList MCharsetDetector::getAllDetectableCharsets (  ) 

returns a list of detectable charsets

This does not depend on the state of the MCharsetDetector object, it always returns the same list.

bool MCharsetDetector::hasError (  )  const

checks whether an error occurred during the last action

See also:
clearError()
errorString()
bool MCharsetDetector::isInputFilterEnabled (  ) 

test whether input filtering is enabled or not

See also:
enableInputFilter(const bool enable)
void MCharsetDetector::setDeclaredEncoding ( const QString encoding  ) 

set the declared encoding for charset detection.

Parameters:
a charset name to give as a hint

This can be used as an additional hint for the charset detector.

This is passed through to the libicu encoding detection system which currently does not do anthing with that information. But MCharsetDetector uses it to improve the matches returned by the libicu encoding detection.

See also:
setDeclaredLocale(const QString &locale)
void MCharsetDetector::setDeclaredLocale ( const QString locale  ) 

set the declared locale for charset detection

Parameters:
a locale name to give as a hint

This can be used as an additional hint for the charset detector.

For example, setDeclaredLocale("zh") may help to detect Chinese legacy encodings better, setDeclaredLocale("zh_CN") may further improve this for simplified Chinese legacy encodings, setDeclaredLocale("ru") may help for detecting Russian legacy encodings better, etc. ...

See also:
setDeclaredEncoding(const QString &encoding)
void MCharsetDetector::setText ( const QByteArray ba  ) 

sets the input byte data whose charset is to detected.

QString MCharsetDetector::text ( const MCharsetMatch charsetMatch  ) 

get the entire input text converted to the encoding of a match


Copyright © 2010 Nokia Corporation
MeeGo Touch