Wednesday, December 05, 2012

Fixing a lamp with #3Dprinting

The next #copyright battle ground is likely to be all about 3D-printing. Consider the story of the IKEA lamp in Wired; the shade was broken and a designer created a dozen new shades. New shades were printed at a cost of $5,- and the availability of a 3D-printer for several hours.

When a design can be exactly replicated at will there is no need for the factories or shops of IKEA. Consider the impact on retail shops; there is no need for a lot of storage room; all that is needed is a powerful 3D-printer to provide customers what they want.

Obviously factory produced goods are cheap. The cost of a product is very much in things like transport, distribution, overhead and breakage. At this time 3D-printed goods are more expensive to produce. This will change and a new cottage industry will come to exist.

Designs owned by global players can be easily copied and changed and there is little to prevent this from happening. Obviously there will be a lot of copyright posturing and industries expecting royalties because 3D printers are there just to rip off the rights holders...

Ah well, it is the likely way of progress. I however like the notion of more local production.

Wednesday, November 14, 2012

A #font for people with #dyslexia III

Yesterday I met a teacher. I also met someone who was a counselor in Mali. She was really enthusiastic about the news that there is a freely licensed font that works especially well for people who suffer from dyslexia.

Today I received a mail where I was told that the availability of the Open-Dyslexic font will be the basis of a possible project in Mali.

A #font for people with #dyslexia II

The good news is: the Open-Dyslexic font will be part of the initial roll out of the "universal language selector". The font is already available for some 25 languages on Wikidata and where the "universal language selector" is already deployed. These languages are listed below.
  • en - English,
  • af - Afrikaans
  • ca - Catalan
  • cy - Welsh
  • da - Danish
  • de - German
  • es - Spanish
  • et - Estonian
  • fi - Finnish
  • fo - Faerours
  • fr - French
  • ga - Irish
  • gd - Scottish Gaelic
  • hu - Hungarian
  • is - Icelandic
  • it - Italian
  • lb - Luxembourgisch
  • mi - Maori
  • ms - Malay
  • oc - Occitan
  • pt - Portuguese
  • sq - Albanian
  • sv - Swedish
  • sw - Swahili
  • tr - Turkish
  • wa - Walonian
Many languages are missing in the list. For me the lack of nl - Dutch is obvious. But I do not have to break a sweat to come up with languages like sr-Latn id tl li gl fy sl. Languages like Indonesian or Tagalog are spoken by hundreds of millions of people. 

When you consider that languages with special characters like Icelandic and French are in this list it is obvious that many more languages are already supported by this font. When you consider that the WMF has good relations with the creator of the font it is more than likely that missing characters for your language can be added  to the font.

So check out first if all the characters used in your language are supported by the Open-Dyslexic font. If it is, ask for it to be enabled either by making the request on the support page of or by requesting them on bugzilla. You can also send a mail with this happy news to all the teachers you know. When it is not, identify the missing characters and, ask nicely for them to be added.

Open Source is such a wonderful enabler. Consider; not only is this font available to all the teachers and the sufferers of dyslexia, it is also available for use on websites like Wikipedia. Given that our aim is to share in the sum of all knowledge this is an important step in the right direction.

A #font for people with #dyslexia

Today I was able to make some difference. I met a teacher who is specialised in children with "special needs". The conversation got onto the subject of language technology and I mentioned the existence of the Open-Dyslexic font.

She was so happy to learn about the existence of this font. She asked me to send details today so that she can install the font on her pc tomorrow. This week she is going to meet some thirty persons who will be happy to learn about the existence of this freely licensed font as well. A lot of children are going to benefit.

I cannot wait for the Wikimedia Foundation to make the Open-Dyslexic font available for use on its wikis. It will affect so many more people. It is likely to be mentioned in the international press; it is that relevant.

Tuesday, November 13, 2012

Tagging #British paintings in public ownership

Another great British initiative: adding tags to the entire national collection of oil paintings in public ownership in the United Kingdom. What makes this project relevant is the huge number of volunteers collaborating on this.

The Public Catalogue Foundation has something to say about copyright. Sadly they are wrong. They say:
Is there copyright in the photographic image as well as copyright in the actual painting?
Yes, there is. The photographer uses training and skill to photograph a painting and he or she holds copyright in that photographic image. However, as part of its agreement with collections, the PCF agrees that its photographers will hand over copyright in all their photographic images for collections’ own use.

There is a word for this point of view: copyfraud. When this point of view is typically British, it is just that.

The PCF may claim copyright on the tags added to the paintings. There is no mention about this.

Tuesday, November 06, 2012

#Wikivoyage pictures; the fun of being there

The coin dropped for me. Once Wikivoyage is a running Wikimedia project, it is about travelling, going on a holiday, seeing the sites. All these articles need their illustrations. I think that this is to be expected. It is obvious to me what kind of pictures they can be.

Many of them will be holiday pictures and, there is nothing wrong with that. These pictures will have a use, they illustrate Wikivoyage.

What I like about it is that they broaden the horizon of Commons and hopefully inject Commons with many more fun images.

The Commons challenge will only become more relevant; what will it take to make it the wiki where everyone can easily find a great and appropriate media file.

Saturday, October 27, 2012

Depressed because of a #photo #competition

In many a #Wikipedia there is an article for almost every subject. All these articles need illustrations and  Commons is where these illustrations are stored. Some subjects are hard to illustrate. Subjects like depression or schyzophrenia for instance.

In my home town there was a competition to create something that expresses depression. There were prizes to be won and the jury included a professional photographer. There were many great entries to the competition. Having a professional photographer in the jury however was a mistake.

When you run a competition, there is a point to it. This contest had the intention to bring attention to the subject of depression. This means that the winning picture has to be published. Obviously it should be part of the conditions of entering the competition. It wasn't.

The professional photographer told the winner that the picture was of professional quality and that it is worth money. The result is that the picture is only available on the Intranet of the organisation running the competition.

The picture you see did not win.. It does show the Almere skyline. The weather was a bit depressed as well. Thanks Tamar for sharing your picture that "also ran". For me this picture is much more the winner because it is available.

Thursday, October 04, 2012

Celebrating the Dead, Wiki Style!

In Mexico, the dead are celebrated once a year during an event called “Día de Muertos”or Day of the Dead. It is a syncretism of indigenous beliefs with Catholicism. While it is observed in almost all of Mexico (and now parts of the United States), how it is celebrated varies from one place to the next. For example, local observance can last from one to three days. The “day” is 2 November, but those who observe three days begin on 31 October. Preparations for the event begin well before this and usually include the creation of an altar called an “ofrenda” (offering) which includes traditional Mexican foods, fruits and other produce in season and if at the home, photos of the deceased being honored along with offerings of things in life that they enjoyed. At my home, my mother’s photo is accompanied by Milky Way candy, Pepsi, a cup of tea and even a pack of cigarettes. In addition to ofrendas, it is traditional to clean and decorate family graves on this day and even spend a day or night there. During the month of October, many schools and cultural institutions sponsor events as well.

The library at my school, ITESM-Campus Ciudad de México in Mexico City, wanted to participate in some way with Wiki Loves Libraries which occurs in October and November. We decided one of the best ways we could introduce working with Wikipedia to students was to sponsor a photo contest, similar to Wiki LovesMonuments, themed for this holiday. While there are a number of photographsalready in Wikimedia Commons, they do not really begin to tell the story of this very rich tradition. The contest has three categories with prizes: 1) the best photo (which I call the “Wow” category) 2) the most original photograph (of something no one else thought to take a picture of) and 3) the student that uploads the most photographs of different things to Commons between the contest period which is 5 October to 5 November 2012. The three categories are there to encourage different kinds of photography, not only good pictures with good cameras and techniques, but photos of local traditions, preparations and more as well as increasing the breadth of photographic coverage. It not only allows students who don’t have access to expensive cameras and training a chance to win, it also aims to capture images and themes which are not already in Commons.

Sunday, September 30, 2012

Open #Wikipedia for people who are #Dyslexic II

Great ideas are not uncommon. I am really happy to learn that Reedy had the idea to support the Open-Dyslexic font in MediaWiki before me. As you can see in the screen shot, progress has been made.

It is now just a question of dotting the i and crossing the t and pushing it out.

Saturday, September 29, 2012

Open #Wikipedia for people who are #Dyslexic

We have the technology to make an important improvement for the usability of Wikipedia. The technology exists in the ability to provide web fonts from within MediaWiki. The opportunity is to improve the readability for people who are dyslexic.

What it takes is to enable web fonts functionality on for instance the English Wikipedia and allow the use of the Open-Dyslexic font. The license is a bit weird for a font but it is within what the WMF is comfortable with; the CC-by license.

It should not take long to appreciate why this is a good idea.. it should not take too much after that to help people with dyslexia.

Monday, September 24, 2012

About the #quality of #Wikipedia

Sometimes a new item is too good to be true... and then again..

I have no comment about the relative quality of Wikipedia compared to Congress's $100M research service. What I find of interest is the implied value that is put on the quality of Wikipedia. Eh, I assume the value is $100M a year.

Thursday, September 13, 2012

Use Europeana's TimeMash for #WikiLovesMonuments

#Europeana has a rich repository of images of many, many monuments. They show what monuments looked like once upon a time.

Wiki loves monuments is a photo competition where people are take pictures of monuments what they look like at this day and age.

The TimeMash application adds a dimension to the pictures that can be taken; it helps photographers make pictures from the same angle. The combination is powerful; pictures taken in this way are more interesting.

It is possible to have both images on Commons and even used in the projects. The challenge is to register and link such pictures and to have a tool to switch between such images.

Tuesday, August 28, 2012

The need for both #OpenID and #OAuth

Many words have been used on the merits of OpenID and OAuth. There are many misconceptions and many of those have everything to do with perspective. In order to get a better understanding I asked on the Wikitech mailinglist a use case for OAuth. The answer I received helps.
OpenID is an identity management system. It allows users to authenticate to one site using another site as their identity. A use case for this is, for example, using your Facebook account to log in to Wikipedia. This may be useful, as it would allow users to more easily register for Wikipedia
OAuth is a third-party authentication and authorization system that allows outside applications to do stuff on behalf of a user. The reason for this is because currently toolserver applications, etc. authenticate to Wikipedia using a plaintext username and password, which is extremely insecure for a number of reasons I will not elaborate on here.
When you read the answer, there are some observations to make. The most obvious is how do you assure that the software that is to use OAuth will be secure. Given the power of many Toolserver tools how do you make sure that only trusted people make use of the Toolserver functionality.

Enter OpenID, it does provide identity management. OpenID is able to provide more information than just "this is indeed the indicated identity" as part of the "OpenID Attribute Exchange". When the Wikimedia Foundation implements OpenID as a service, it will be possible to identify the users that have a "bot flag" on the user profile. 

As it is, the Toolserver tools are not necessarily secure. With OAuth it will become even less secure to run the software because it will be the software itself that includes the authorisation to run, never mind its configuration, never mind how it is used or by whom. When OpenID authenticates users, it becomes possible to ensure that only people with a bot flag can run Toolserver software on the production Wikimedia projects.

To make the use of the Toolserver tools secure, it is necessary to complement OAuth with OpenID. Oauth in isolation will make the Toolserver tools easier to use but it does not make them more secure to run.

Thursday, August 23, 2012

#Wikimedia Commons, a relevant history

A long long time ago when Commons the wiki was created, it was only a dream. The dream was to have one place where all the images could be shared freely among all the Wikimedia Foundation projects.

It was a dream with really practical implications. At that time sharing a picture meant uploading the same picture to another wiki. At that time removing a picture because of a copyright violation was removing it on one wiki and perhaps on one or more wikis as well. This was an inexact process with inadequate results. The dream of Commons was to store a file once and use it on any wiki. The dream of Commons was that administrative procedures would be significantly enhanced because everything could be done once and everything could be done right.

Commons was created and it took several months before pictures on Commons were available on the WMF projects.

Today Commons fulfils much of this promise. Media files are shared on all the WMF projects, you can even enable this functionality on external MediaWiki wikis. Most of the duplicate files have been deleted and a single copy only exist on Commons. Nowadays the administrative procedures are much more effective. Everybody involved in Commons will agree it is not perfect, far from it but, Commons is thriving nonetheless.

This notion of adventure that existed in the Wikimedia Foundation seems dead. I am convinced it should be revived. There are some signs that it is possible ... in things like moving towards agile development ...

More later

Tuesday, August 21, 2012

a #threat assessment for #Wikipedia

The people who take care of mail send to Wikipedia are often informed that Wikipedia is not secure; "everybody can edit Wikipedia". This is actually intentional because Wikipedia is the encyclopaedia that everybody can edit. The real risk is when people do not recognise they are invited to edit. This is a genuine issue and it is something that receives a lot of attention.

When you consider security for Wikipedia, the people most at risk are its editors. There are several threats they are exposed to. Several of these are issues computer security can deal with.
  • threat to the anonymity of a registered user
  • threat to user credentials
When the potential threats are evaluated, it is important to realise that the severity of these threats is not obvious. It matters considerably where you reside, what your ethnicity is or what your belief system is. It is important to minimise any threats because once people no longer feel free to contribute it will damage the "neutral point of view" that gives Wikipedia much of its relevance.

With the implementation of SSH it has become considerably more difficult to learn what a person is doing when working on Wikipedia. This has been a real improvement. However, user credentials and particularly passwords are considered not really secure. Read for instance what Wired had to say about them. It is explained that improvements can only be expected when changing the infrastructure of online security. This will probably do a whole lot more good than lecturing people about how they should change their behaviour.

The question is if the WMF is open for such considerations. So far the talk is about "Nascar" ?!?! to me this sounds remarkably like bikeshedding and is very much beside the point.

Monday, August 20, 2012

#OpenID for the USERS of #Wikipedia, PLEASE

Again a discussion about the use of OpenID for the Wikimedia projects flared up. From my perspective the one perspective missing is the one of a computer user who is fed up with the failed security that is provided by passwords.

The problem is that systemmanagers only consider security in isolation. It is the solution that is to be adopted for their system or systems. Obviously in a perfect world, a user will have a separate password for each website or program. The world is not perfect and most people use one or a few passwords for everything. The world is not perfect and passwords of many big websites have been uncovered by hackers. Consequently many passwords used by Wikimedia contributors can easily be guessed by the bad guy who are in the know.

The problem with passwords for a user is that they are unmanageable. Too many systems and websites, too many interfaces seriously impact the security wherever passwords are implemented to provide security. It is theatre and the fool is the part you have to play.

OpenID provides a serious alternative. It allows for a single place with a single password that authenticates to any and all websites and services that accept security in this way.  It is a serious alternative as long as any and all accept other OpenID. It will be really welcome when the WMF considers security for its 456 M users. It is obvious that a large percentage also frequent websites like LinkedIn and solidify the argument to implement OpenID.

Sunday, August 19, 2012

#Font support by #Google and #Wikimedia

The quality of the Amiri font has been recognised by the Wikimedia Foundation for some time; it has been provided in the Web Fonts extensionn for some time and recently in an other bit of "read the whole article to get to the good news" it became part of a Wikimedia supported jQuery library for web fonts.

Google has also recognised the Amiri font; they make Amiri available through their "early access" program. Google supports multiple (freely licensed) Arabic fonts as web fonts. The biggest difference for me between the two programs is that the WMF library has your server provide the fonts while the Google offering has the fonts provided through the Google infrastructure.

Both Google and WMF support fonts for many scripts. The question is what they will do when the font has technical requirements that are more than usual.

Recently the Tuladha Jejeg font for the Javanese script was given a free license and the Wikimedia Foundation will assess if they will include it in their Web fontd extension. There may be a technical problem; it makes use of the SIL Graphite technology. The question is to what extend do browsers support this technology.

When Google is serious in supporting the rare scripts, the opportunity to support the Javanese script through the Tuladha Jejeg font may be reason enough to put some extra effort in enabling support for SIL Graphite.

<grin> One may hope for Google to do more good, we know they can</grin> <seriously> For the WMF there is hardly another option when they are to support Javanese </seriously>

#Internationalisation is more than conversion of numbers

A friend read my last blogpost and pointed me to a recent presentation about the current best of breed internationalisation for JavaScript. He had been testing the new jQuery internationalisation library published by the Wikimedia Foundation and was astonished that the one thing "everybody" does was missing.

His question was: "where is the conversion of numbers and dates?". Obviously MediaWiki "does" the conversion of numbers and dates, it is just not part of the library that enables what is most crucial for us in Internationalisation; the translation of the messages in more than 280+ languages.

My friend Andrew was thinking of extending the library with the conversion of numbers and dates. It makes sense to have them included however, forking this really new library at this early time is "evilish". Talking to Santhosh, the developer of this library is the thing to do. It is, because in this way any future improvements in the existing 280+ languages or the missing 6000+ languages will be shared by anyone who updates to this library.

YES, I know jQuery is Java and not JavaScript. But I also know that my friends at the Wikimedia Foundation support the localisation of their JavaScript.

Friday, August 17, 2012

#jQuery and #Internationalisation of YOUR application

There has been a lot of good news about jQuery this week.
The news of this jquery.i18n library is wonderful news. As it is the same software as used for the internationalisation of MediaWiki, it represents the current knowledge of the 280+ languages that have their Wikipedia.

When your software is open source and when you adopt this library to implement internationalisation, you are that much closer to make use of that wonderful community at They are already doing a great job for many applications, why not yours?

Wednesday, August 01, 2012

Learning to type #Arabic; #Wikisource exercises

The one skill that makes it easy to use a computer is typing with ten fingers. Particularly when you learn a new script, you need to learn to type again. Learning this skill well means exercises.

I remember when I learned to type on a typing machine that I hated the futility of these exercises. Now that I need the skills to learn Arabic, I can think of exercises that are of use to everyone.

The WIKISOURCE for the Arabic language.

All Wikisources have the ProofreadPage extension installed. So all that is needed to tap in the vast pool of people that are learning to type Arabic is having pages ready using this tool.

My skill in Arabic is such that I can not easily find these. When I have, the Arabic Wikisource has its first volunteer that will exercise and do good at the same time.

Tuesday, July 31, 2012

Learning a new #script .. #Arabic

As I am learning a new language with a script that is new to me, I find the Internet not yet the resource it is in languages for the Latin script.

There are several obstacles; I have to configure my computing devices for the Arabic script and keyboard. I have to find the characters on my keyboard and only when I do will I get the facility that I need.

The one thing that would make my day is to have a typing tutor that is ready for an international public. This means that the instructions are available in multiple languages for the same input method for the same language. When these parts are separated from the code, there are three specific parts.
  • the user interface
  • the exercises
  • the input method or keyboard layout itself
This allows the same software to be used for multiple input methods and scripts / languages. The software can be localised at and the typing tutor can be distributed with the input methods themselves. 

Wednesday, July 25, 2012

Multilingual cluefull bots

The #BBC writes that #Wikipedia without its bots is doomed. It is a good story, a happy story and well worth a read. Wikipedia as you know is not only the English language Wikipedia, there are over 280 Wikipedias in different languages. They are not all created equal; most of the smaller Wikipedias have less than 100.000 articles, a small community of editors and their own issues with people who think it funny to write penis wherever they can.

The BBC article writes about "Cluebot NG" and is said to reside on a computer somewhere. It would be a great project to move Cluebot NG to a Wikimedia server and make instances for all the other languages. First for the 40 something languages with more than 100.000 articles and when its lessons are learned there may be scope to expand.

There is a committee supervising the bot activities, it will be wonderful when it has the potential to expand its expertise as widely as possible.

Tuesday, July 24, 2012

Levels of competence for #Arabic in #Wikipedia III

Tools like #Google #Chrome use an engine to do the rendering for them. For Chrome it is Webkit. It is responsible to connect two items that are in separate HTML elements and make sure that the rules for the Arabic script are addressed.

What started as an observation in MediaWiki became a Chrome bug and is now a Webkit bug.

Webkit is used in multiple browsers.. This makes a fix relevant for Apple's Safari browser as well. The one question left is what will it take to bring a fix in production within those browsers in the near future.

Sunday, July 22, 2012

Levels of competence for #Arabic in #Wikipedia II

It does matter what browser you use. Google's Chrome browser is the reason why the Arabic Babel templates.

If you use Chrome to browse Arabic websites, you want to add your vote to fix this bug. You vote by marking the star that is at the bottom of the page.

I am really pleased how quickly this issue was analysed. It is now for Google to support its users who use languages in the Arabic script.

Levels of competence for #Arabic in #Wikipedia

As my competence in the Arabic language is growing, I added Arabic in my Babel information. To my amazement what I read is wrong. The "ba" is written on its own while it should be connected with the following "aliph". It is one of those peculiarities of the Arabic script that a character like the "ba" is expressed depending on its position in a word.

I had a look at to understand why it is wrong; as you can see the "ba" or ب is 
connected to a wiki link and the word is not formatted after what is in the variable.

It is fun to learn other languages. What is really amazing is that I can find glaring issues like this one.

Thursday, July 19, 2012

European cranes want to be free


This wonderful video of European crane courtship is currently copyrighted and not freely licensed. The video comes with a button that allows you to embed it in a website and its article has a button where you can order the video.

The video is on the website of Stichting Natuurbeelden and they made a nice offer; 50 of their videos will become available under the CC-by-sa license. There are many great videos to choose from and for all kinds of reasons, the initial offering is to choose at most 50 of them. 

Organising this selection will not be easy and the relevance of some videos is very much in the context of nature in the Netherlands. European cranes are breeding again in the last few years. When the fifty videos are selected and when they are used in our projects, these videos will be seen quite often. At the moment when I write this, the number of views for this video is only 136.

Saturday, July 14, 2012

Can everybody read #Wikipedia?

A lot of effort goes into making "Wikipedia the encyclopaedia everybody can edit". The result is wonderful; there are Wikipedias in over 280 languages, a big effort is under way to make editing even easier and as so many people do edit, it became a rich almost authoritative resource. When Wikipedia goes off-line, students despair.

People do read Wikipedia, it is very popular but the notion that Wikipedia is hard to read is not really considered. Take for instance today's featured article.There are too many words on a line. For many people this makes it hard to read, some give up on an article or on Wikipedia.

Yes, you can change the way content is displayed on a computer screen. The problem is that people who have problems read websites in the default format.

The Wikimedia Foundation does have the expertise to consider these issues. The people who do have already too much on their plate to support the current software development. However, the proof of the pudding that is Wikipedia is in people READING its content and that makes this a key concern.

Tuesday, July 10, 2012

Something positive to say about #Apple

Apple is innovative, it routinely adds new functionality and quality to its products. Many people love Apple products and pay a healthy premium for the latest and greatest.

One of the recent innovations goes by the name of "retina display". It is a high resolution screen of such quality that the human eye does no longer see individual pixels. Innovations like the retina display add demand for high quality images and, it does stimulate the use of high quality images and the use of SVG or scalable vector graphics in WMF projects like Commons and Wikipedia.

The improved technology used in Apple hardware stimulates quality improvements at content providers. In this way Apple stimulates a healthy and innovating content ecosystem. As other hardware suppliers are continuously catching up with the leader of the pack, there is real value in buying Apple.

I did  it, nothing negative to read in this post about Apple.

#GLAM - About recognition

 Left Hand Bear, Oglala chief
This years #Commons picture of the year contest was different from last years. The many old images that were so lovingly restored and featured the Commons main page were not there any more.

A thread on the mailing list reminded me about all the hard work that gives images of the past a new lease of life. The image of Left Hand Bear, the Oglala chief is used a lot. As you can see below it is even used to make ties, mugs and buttons.

The image of Left Hand Bear has been lovingly restored by Adam Cuerden. The original of this image is at the Library of Congress and I owe a debt of gratitude to both Adam and the LoC.

Adam restored an image preserved at the Library of Congress. Knowing this, I am sure that this is indeed an image of Left Hand Bear. The image is obviously in the public domain and as such I am not required to acknowledge either the LoC or Adam. I may put the image on mugs, ties and buttons and sell them.

For both Commons and Wikipedia, acknowledging the LoC and Adam bring important benefits. Acknowledging the LoC provides provenance of the image, this is the equivalent of providing a source to a fact. Acknowledging Adam links the much improved image to the original. It recognises Adam for his work.

Acknowledging the LoC and Adam IS a best practice. It is a best practice promoted by organisations like Europeana. It is a best practice that is not a requirement, it is however something that we should aspire to.

Monday, July 02, 2012

#Kiwix - the interview

#Kiwix is the tool that allows you to read the content of a Wiki offline. It has been developed with Wikipedia in mind but is equally usable for Wikisource or Wikibooks. I am really happy to have interviewed Emmanuel who knows all the ins and outs of this wonderful piece of software.

What is Kiwix and what is it used for
Kiwix is a software that wants to enable people to read Web contents without internet connection. It's a reader which works with ZIM files containing all the content. It's used to access Wikipedia offline, by reading pre-packaged Wikipedia ZIM files. It's mainly used by people who want to have an encyclopedia, but are too poor to have access to the internet. It's also used,for example, by travelling people (plane, ship, train), prisoners and students at school. 

Can you tell us something about its popularity
We have users all over the world and the audience is increasing quickly: we have had 25.000 downloads of Kiwix in May

In how many languages is Kiwix supported 
Thank to the Translatewiki Web site andits community, Kiwix is localised in more than 80 languages. We also provide content (ZIM files), mainly Wikipedia, in around 25 languages. But we want to do more: thanks to a grant of WMCH soon we will offer ZIM files of Wikipedia in all languages

How do you support languages written in scripts like Malayalam, or Tibetan
Contents are Web contents and Kiwix itself is a sort of browser getting the Web pages from the ZIM file instead of the Web. So, we do not have special handling in Kiwix itself to render the contents. Everything should be well organised in the ZIM file, for example by using Web fonts. But, from the Kiwix fulltext search engine point of view, this is challenging. Natural languages have a lot of particularities. Kiwix uses the Xapian search engine and tries to integrate CLucene. We do our best with them to offer the smoothest user experience possible. 

Do you provide fonts with Kiwix for the languages that use these scripts
The Wikipedia ZIM files we are preparing still do not provide the Web fonts. Already for a few months, the integration of Web fonts has been a part of the Wikimedia projects, so we have to fix that ASAP, this is not a big challenge.

For some languages like Chinese and Serbian, we show the content in two scripts ... Can Kiwix do this as well ?
Kiwix does not provide any transliteration tool for now, but all the technology is already there in the soft. We use a powerful unicode library called ICU ( which can do that. We
want to use it to allow users to do custom transliterationsC++ developer wanted there!

Kiwix uses the OpenZIM format ... can you tell us more about this format
The format is called ZIM. There is a volunteer driven project called openZIM  created a few years ago to specify the format and develop a standard library. The ZIM format allows to put millions of contents together, to compress a part of them, and add Metadata. In the end, you get only one file, which is, at the same time, extremely compressed and allows a constant and quick random access.

Nowadays, many publications are in the EPUB standard ... can Kiwix handle this as well
Kiwix is not able to deal with EPUB, but in the future it will. We think EPUB & ZIM format are complementary and we want Kiwix being able to perfectly deal with EPUB. Our plan is to integrate "Monocle" to do that. Also there, developers are wanted.

How do people find content available for Kiwix
Kiwix has its own content managerso you can download content from Kiwix itself. But you may also download the ZIM file from the Kiwix Web site ( or using the Mediawiki Collection extension.

In the future, we want to have a platform (something like Itunes) to offer  really easy to find and download contents (both ZIM and EPUB files). We have started a project in that perspective. We need your support! 

What is your biggest challenge at this moment in time with Kiwix
Building Kiwix-mobile for Android. We will have a first release in autumn. But we have many other projects running at the same time and others for which we still need volunteers.

Thank you

Saturday, June 30, 2012

#ImpactOCR - Digital publications and the national libraries

According to the #ISBN standard, the "format/means of delivery are irrelevant in deciding whether a product requires an ISBN". However, it is often assumed that a publication requiring an ISBN number is a commercial publication. In the USA and the UK for instance you have to buy your ISBN number or bar code while in Canada they are free because Canada stimulates Canadian culture.

When a standard is not universally applied, it loses application. When all publications are not registered a national library will have to maintain its own system when it is to collect a copy of all publications. As a result the ISBN is dysfunctional as a standard because it does not function as a standard.

When the Wikisourcerers finish the transliteration of a book, it deserves an ISBN number and, national libraries should be aware of these publications. This recognises and registered the work done in the Open Content world. When these books are registered, all the Open Content projects may know that they can concentrate on another book or source.

As the ISBN does not register all publications, it does not do what it is expected to do; function as a standard. 

Friday, June 29, 2012

Learning another language II

Learning to vocalise #Arabic can be done on the Internet. There are websites like Mount Hira where you find Arabic to the right, a transliteration to the left with an explanation in English underneath. What really helps is that you can listen to the recitation of a surah by line.

It is great but there is room for improvement. Improvement that can happen at Mount Hira but also at a Wikisource.
  • Show all the text as text and not as graphics
  • Allow for the Arabic text to be shown in fonts representing different writing styles
  • Allow for the explanations to be shown in a language that can be selected
Learning to vocalise what you read is one use case, learning Arabic is another and learning the Koran is a third. Wikisource has the potential to be a place for all three objectives.

There are always people who are interested in reading the source documents about a religion, any religion. The great thing of Wikisource is that you can include the wikilinks explaining the terms that are ambiguous or obscure.

Typically these source documents are readily available and out of copyright. Including them in Wikisource will gain it more public. Wikisource is an obvious place because it has a reputation to keep up; the reputation that original documents are original.

The Hebrew #Wikipedia is bigger than all of #Wikisource

When you compare the statistics for Wikipedia and Wikisource, the Wikipedia in Hebrew gets more eyeballs than all Wikisources combined. When you consider the potential public for both projects, the public for Wikisource is bigger by orders of magnitude.

On the main page of the English Wikisource it says: "We now have 783,122 texts in the English language library". It is less clear how many of these texts are in a final form; meaning ready for casual readers. The pages linked to the proof read tool indicate 691 completed pages and 3588 pages ready for proofreading. This does not imply that there are 691 text that are ready.

For a reader, the sources at Wikisource are works in progress. A text featured on the Wikisource main page, like the Celtic Fairy tales, is very much a Wiki page. The presentation of the text and illustrations is very much accidental. It is a bit sad is, but the pages where easier to read when they still needed to be proofread.

As you can see, the long lines below are not easy to read and the font used in the original book made the text more legible.

The Hebrew Wikipedia is easy to read. We can and should do better for Wikisource.

Thursday, June 28, 2012

#ImpactOCR - A #font using #unicode private use characters

When really old historic texts are digitised and OCR-ed, the images of letters found are mapped to the correct characters. Characters are defined in Unicode and when a character is NOT defined, it is possible to define them in the "private use" space.

As part of the Impact project, really old texts have been digitised, texts in many languages. At the recent presentation it was mentioned by two speakers that there were characters used in the Slovenian and Polish language that are not (yet) defined in Unicode. As part of their project, the missing characters were defined in the Unicode private use area and the scanning software was taught to use them.

With the research completed, with the need for all these characters and their shape defined, it will be great when these characters find their way in Unicode proper. When the code points for the missing characters are defined and agreed, the OCR software can learn to recognise the characters at the new code points, a conversion program can be written for the existing texts and it will be more inviting to include these characters in fonts.

Now that the project is at its end, it is the right moment to extend the Latin script in Unicode even further.

#ImpactOCR - Citing a #newspaper

#Wikipedia has this rule: "Citation needed". Much of the news is first published in newspapers. When a citation is needed about something that happened, it is in the newspaper where you will find it mentioned and there may be many chronological entries on the same subject describing how something evolves.

A lot of research and development has gone in the optical character reading of newspaper of the Impact project. As this project has ended and has evolved into a competence centre, its last conference was very much a presentation of what the project achieved.

From my perspective, it produced a lot of software much of it open sourced and all of it is implemented and embedded in the library, archive and research world. It is a world that finds its public for the work done in the Impact project very much in the research world. The general public can benefit as much, what has to be clear is how it could benefit.

Though Europeana newspapers some 10 million newspaper pages will be made accessible. These pages are scanned and to make them really useful they undergo optical character recognition. This is exactly where the Impact project has its impact; as the OCR technology improves, more words are correctly recognised and consequently more content of the newspapers can be discovered.

The results can be improved even further when the public helps train the OCR software recognise characters for specific documents. As citing sources for Wikipedia is an obvious use case for historic newspapers, there are many people who are willing to teach OCR engines to do a better job. For those articles that are found to be particularly useful, proofreading can improve the results even further.

With a public that is involved in improving the digitised and OCR-ed texts everybody will be a winner including the scientific research on these texts.

Monday, June 25, 2012

Learning another language

Learning to read #Arabic or more correctly Classical Arabic or "al fusha" is fun. I enjoy it for the challenge it provides, it helps because the many transliteration schemes suck and often confuse more than help.

When I started on this road to learn Classical Arabic, I was told that there is only one Arabic and I am also told that when you speak Classical Arabic in Arabic speaking countries, you are thought to be weird and often you find yourself not understood. I do not point out as often that the second assertion negates the first and consequently there are many different Arabic languages. I do not yet understand to what extend the Koranic Arabic language is different from what is labelled "standard" Arabic language

At this time I am learning to pronounce fully annotated texts. When you are used to Latin script, one of the lessons is that a space is not necessarily the dividing line between words. This is hard because it clashes with how you have learned to perceive a word at a glance. Al Arabiya or العربية shows two spaces but for me it is one word. To complicate it further, the vowels are missing and you need to know the Arabic grammar and vocabulary to know how to pronounce it with certainty.

Add to this the different ways Arabic is written and printed and you will appreciate learning the Arabic script for the intellectual challenges it presents.

Monday, June 18, 2012

#Batak - getting a #script ready for the #Internet

Several languages from Sumatra, #Indonesia were originally written in the Batak script. This script was encoded in Unicode and the waiting was for a freely licensed font. Thanks to a grant by the Wikimedia Foundation, a project is under way that will produce a font for Batak and will transcribe sources from Dutch museums like the Tropenmuseum.

For a script to be encoded into Unicode, a lot of research goes into describing the font and how it functions. For Batak you can find this documentation here.

When you read this, you find details like how the combination of a consonant and a vowel is expressed in the Batak script. When smart algorithms are used only the valid combinations will be expressed.

Given that multiple languages use the Batak script, the rules that are implemented in a font need to allow for every Batak language.

The first iterations of the Unicode font for Batak are being tested. When this is done, the font will be available and it will be possible to transcribe original Batak sources including a book on sorcery.

Saturday, June 16, 2012

Wikis waiting to be renamed

Many a #Wikipedia was created before  the language policy was created. One of the requirements of the language policy is that any request for a new Wikipedia is in a language that is recognised in the ISO-639-3 standard.

At that time several Wikipedias were created that were not recognised as a language. In the mean time several of these have been recognised as a language and as a consequence have their own code.

bat-smg -> sgs (wikipedia)
fiu-vro -> vro (wikipedia)
zh-classical -> lzh (wikipedia)
zh-min-nan -> nan (wikipedia, wiktionary, wikibooks, wikiquote, wikisource)
zh-yue -> yue (wikipedia)

With the deployment of  bug 34866, an important improvement has been realised; the content of the Wikis involved do now indicate correctly in what language they are written. This helps because there is now more content correctly available on the Internet.

It is a relevant step in the direction of giving many wikis the name we would like them to have.

Thursday, June 14, 2012

#Wikidata can get more facts out

The big discussions that are raging on mailing lists about the infoboxes that can be populated with data from Wikidata are amusing. Amusing because it has been made plain that this is not something that is planned for in the near future. Amusing because there is more to it.

Consider the following scenario; two Wikipedias do not agree on what source to use for a specific bit of information. The result is different data in the infobox. What is passed as a solution is to override things locally. Nice thought but it is the wrong approach; it is reasonable to assume that multiple Wikipedias opt for either option. Wikidata as a data repository is agnostic to such issues. It can happily store information from multiple sources and have multiple info boxes for the same category of subjects.

You may be able to override and maintain data on a local level. Doing so does not make it a best practice. Far from it, Wikipedia has this neutral point of view and Wikidata does change the rules. It has always been recognised that the NPOV is set by a language community and its prevailing wisdom is not necessarily neutral. The info boxes will not only bring consistency to the data, it will necessarily bring home the notion that facts have sources and many sources have a point of view.


Fairly often when you complain about issues Wikimedia, there is somebody who notices, who cares. This is the sourcery of the crowds, the magic of our communities. Compare the picture above with the picture below..

What a difference a day makes ..

Wednesday, June 13, 2012

The eye on the prize

Noodlot, in translation "Fate" is a book by Louis Couperus. Couperus is one of the literary giants of the Dutch literature and, it makes excellent sense to make this book available to a reading public.

To this end, someone took the djvu file from the Gutenberg project and started the proof reading process at the Dutch Wikisource. This process is still ongoing, I did two pages and I regret it.

The regret is because the book has already been proofread at project Gutenberg.  As the book is in the public domain, the proof read transliteration is also in the public domain.

What is the point ? Why not do another book ?

The aim of transliteration and possibly the editing of the lay out of the book serves one aim. Making it available to readers. It makes sense to do the proofreading once. There are many other books that are waiting to be digitised for a first time. There are plenty of other sources that are waiting, waiting in a library a museum an archive.

Let us not waste our efforts. Let us do things once and let us do them well. When we are done with the proofreading, the formatting we need to find a public appreciative of the work done. To make it attractive it helps when there is a lot to choose from and when it is available in the format expected by our intended public.