If you think this looks like too much assembly, many text analytics tools won’t work for you( by is licensed under )For this reason, we’re focusing on tools that a normal business user can actually get up and running within a few minutes. We promise, you won’t need to compile source code or master complex algorithms. You may need to watch a few YouTube demos, but you were probably expecting that.I’ve personally demoed the following solutions to test their ease of use. Many tools were demoed, but few were selected. Here are the ones we’ll cover:(Click on a link below to jump to that section.)RapidMiner + AYLIEN. Solution type: Text processing add-ons for open-source data mining platformDeployment: Windows, Mac, LinuxWorks for: Sentiment analysis, advanced text analyticsPricing: Free (free version processes up to 10,000 rows with a single logical processor—more advanced capabilities require paid version)is a free, open-source platform for data science, including, text mining, predictive analytics etc.
The features of RapidMiner can be significantly enhanced with add-ons or extensions, many of which are also available for free.Thomas Ott, marketing data scientist at RapidMiner, explains, “The beauty of RapidMiner is that it’s visual programming: You don’t have to write the code, and you don’t have to know the math behind it.”Among other extensions, the RapidMiner Marketplace offers a very functional and user-friendly add-on for sentiment analytics developed by third-party vendor.AYLIEN’s extension can automatically scrape data from Twitter (as can RapidMiner). It then analyzes tweets and scores them with a three-value sentiment scale: positive, negative or neutral.In addition to reading from web sources such as Twitter, RapidMiner can also read directly from flat files such as CSV and Excel files or databases.RapidMiner also offers its own extension for text analytics, which includes powerful text processing features that can be combined with advanced clustering algorithms and machine learning operators.As Ott explains, “there are two main approaches to looking at text. One is doing a high-level overview: word counts, word frequency, where words occur in the corpus the collection of documents being analyzed etc.
The other is more heavy-duty, e.g., sentiment analysis and other techniques in which you train a machine-learning algorithm on a data set.”. Adding clustering algorithms to text mining workflow in RapidMinerAs you can see in the above screenshot, adding advanced analytics to a basic text mining workflow in RapidMiner is as simple as dragging and dropping operators into the proper locations.Once this is done, it’s possible to output complex visualizations. For example, you can create a network showing the relationships between a specific term you want to focus on (such as a brand name) and other terms in the document you’re analyzing.The following screenshot is an example of this kind of visualization. Ott applied clustering algorithms to Federal Reserve Bank meeting minutes to understand relationships between the currencies and concepts under discussion in the meetings. “I’m a beer brewer, and did some Twitter analysis of the brands that people are talking about based on region. It turns out that people on the coasts talk about IPAs, people in the midwest talk about stouts, and people in the southwest talk about ales. This is a key thing for small businesses to look at.
Say, for instance, that I’m a Kia dealer, and I find out that people in Michigan like red cars and people in Montana like blue—I can then adjust my stock accordingly.” Thomas Ott, marketing data scientist at RapidMinerTakeaway: RapidMiner is the easiest to use and most fully featured text mining tool of the platforms I demoed. With the AYLIEN extension, you’ll be able to perform basic sentiment analysis within minutes of downloading and installing.KNIME Analytics Platform. Solution type: Text processing add-ons for open-source data mining platformDeployment: Windows, Mac, Linux (cloud version also available)Works for: Advanced text analyticsPricing: Free (free version lacks advanced capabilities such as batch processing and template sharing, as well as advanced support options)is another robust open-source data mining platform available in a free version with rich functionality.Like RapidMiner, KNIME offers an intuitive visual workflow builder for “programming-free” data mining. Solution type: Dedicated tool for entity extraction and document taggingDeployment: Cloud (on-premise version is paid)Works for: Entity recognitionPricing: Free (limited to 5,000 submissions per day, whereas paid options scale into the millions and offer more extensive sets of category fields for tagging)is a cloud-based content tagging tool offered by Thomson Reuters. Unlike RapidMiner and KNIME, it’s not a data mining suite with text mining extensions, and it doesn’t do sentiment analysis. Instead, it excels in the realm of entity recognition and extraction.You feed unstructured text into the tool, and it recognizes entities such as people, products and companies. It also recognizes relationships between entities and facts about entities.
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. Download this app from Microsoft Store for Windows 10, Windows 8.1, Windows 10 Mobile, Windows Phone 8.1. See screenshots, read the latest customer reviews, and compare ratings for LEADTOOLS Annotations.
It even organizes entities into topics.Open Calais (now Refinitiv) can thus be used to quickly extract information from documents. This information can then be used to tag documents for classification.Some use cases for this functionality include:. Tagging blog articles to improve navigation on a site.
Tagging internal resources on a corporate intranet to help employees find them using search. Tagging knowledge base articles and academic archives etc.Takeaway: Unlike RapidMiner and KNIME, Open Calais (now Refinitiv) won’t work for basic text processing or advanced sentiment analytics. It’s very good at recognizing entities for analysis of unstructured text, and is a robust tool for document tagging.AntWordProfiler. Determining word frequency with AntWordProfilerAntWordProfiler uses preloaded vocabulary and thesaurus lists, which can be edited by the user, in order to determine word frequency. Users can also load custom vocabulary lists into the tool.Results can then be saved in a text file formatted for easy viewing in Excel or another spreadsheet tool. There’s also a document viewer that highlights where terms in your vocabulary lists appear in the document.Takeaway: AntWordProfiler can be used for quick counts of word frequency in complex, unstructured texts, as well as custom vocabulary profiling of unstructured texts. Unlike RapidMiner and KNIME, however, it’s not an end-to-end text mining solution.Grab Bag: Even More Toys!Here are a few other neat toys you should consider experimenting with:Carrot2: A dedicated tool for applying clustering algorithms to documents.
There’s a web-based interface for applying some common clustering algorithms that can help with organizing documents into thematic categories. Carrot2 also integrates with the APIs of popular search engines in order to automatically cluster the results of keyword searches. It can thus be used in search engine optimization (SEO).AYLIEN Google Sheets add-on: AYLIEN, the same company that develops the sentiment analytics extension for RapidMiner that we examined, also offers an add-on for doing sentiment analysis directly within Google Sheets. This is one of the easiest ways to score sentiment in a spreadsheet-style interface, but the number of API calls you can make per day with the free plan is limited.National Centre for Text Mining/University of Manchester Sentiment Analysis: While still in beta, this tool is already quite functional in determining the overall sentiment in a single text (batch upload isn’t supported at this point). One nice feature is that the tool highlights positive and negative terms and chunks of text in different colors.The Data Science Toolkit: A collection of easy-to-use, web-based text mining tools, including basic sentiment analysis. The sentiment analysis tool only supports analysis of short chunks of text at this point.
There are also lots of tools for geocoding text. For instance, you can translate street addresses to coordinates. These tools are also available via API calls for advanced use cases.: If you’re beginning to feel like the free stuff won’t work for you and would like to begin exploring the text mining features of paid BI solutions, you can examine on our site.If there are additional open-source and free text mining tools you think we should list here, please just drop me an email at [email protected].
Localization plays a central role in the ability to customize an open source project to suit the needs of users around the world. Besides coding, language translation is one of the main ways people around the world contribute to and engage with open source projects.There are tools specific to the language services industry (surprised to hear that's a thing?) that enable a smooth localization process with a high level of quality. Categories that localization tools fall into include:. Computer-assisted translation (CAT) tools. Machine translation (MT) engines. Translation management systems (TMS).
Terminology management tools. Localization automation toolsThe proprietary versions of these tools can be quite expensive.
A single license for SDL Trados Studio (the leading CAT tool) can cost thousands of euros, and even then it is only useful for one individual and the customizations are limited (and psst, they cost more, too). Open source projects looking to localize into many languages and streamline their localization processes will want to look at open source tools to save money and get the flexibility they need with customization. I've compiled this high-level survey of many of the open source localization tool projects out there to help you decide what to use. Computer-assisted translation (CAT) tools. OmegaT CAT tool. Here you see the translation memory (Fuzzy Matches) and terminology recall (Glossary) features at work. OmegaT is licensed under the GNU Public License version 3+.CAT tools are a staple of the language services industry.
As the name implies, CAT tools help translators perform the tasks of translation, bilingual review, and monolingual review as quickly as possible and with the highest possible consistency through reuse of translated content (also known as translation memory). Translation memory and terminology recall are two central features of CAT tools.
They enable a translator to reuse previously translated content from old projects in new projects. This allows them to translate a high volume of words in a shorter amount of time while maintaining a high level of quality through terminology and style consistency. This is especially handy for localization, as text in a lot of software and web UIs is often the same across platforms and applications. CAT tools are standalone pieces of software though, requiring translators that use them to work locally and merge to a central repository.Tools to check out:.Machine translation (MT) enginesMT engines automate the transfer of text from one language to another.
MT is broken up into three primary methodologies: rules-based, statistical, and neural (which is the new player). The most widespread MT methodology is statistical, which (in very brief terms) draws conclusions about the interconnectedness of a pair of languages by running statistical analyses over annotated bilingual corpus data using. When a new source language phrase is introduced to the engine for translation, it looks within its analyzed corpus data to find statistically relevant equivalents, which it produces in the target language.
MT can be useful as a productivity aid to translators, changing their primary task from translating a source text to a target text to post-editing the MT engine's target language output. I don't recommend using raw MT output in localizations, but if your community is trained in the art of post-editing, MT can be a useful tool to help them make large volumes of contributions.Tools to check out:.Translation management systems (TMS).
Mozilla's Pontoon translation management system user interface. With WYSIWYG editing, you can translate content in context and simultaneously perform translation and quality assurance. Pontoon is licensed under the BSD 3-clause New or Revised License.TMS tools are web-based platforms that allow you to manage a localization project and enable translators and reviewers to do what they do best. Most TMS tools aim to automate many manual parts of the localization process by including version control system (VCS) integrations, cloud services integrations, project reporting, as well as the standard translation memory and terminology recall features. These tools are most amenable to community localization or translation projects, as they allow large groups of translators and reviewers to contribute to a project. Some also use a WYSIWYG editor to give translators context for their translations. This added context improves translation accuracy and cuts down on the amount of time a translator has to wait between doing the translation and reviewing the translation within the user interface.Tools to check out.Terminology management tools.
Brigham Young University's BaseTerm tool displays the new-term entry dialogue window. BaseTerm is licensed under the Eclipse Public License.Terminology management tools give you a GUI to create terminology resources (known as termbases) to add context and ensure translation consistency. These resources are consumed by CAT tools and TMS platforms to aid translators in the process of translation. For languages in which a term could be either a noun or a verb based on the context, terminology management tools allows you to add metadata for a term that labels its gender, part of speech, monolingual definition, as well as context clues. Terminology management is often an underserved, but no less important, part of the localization process. In both the open source and proprietary ecosystems, there are only a small handful of options available.Tools to check out.Localization automation tools.
The Ratel and Rainbow components of the Okapi Framework. Photo courtesy of the Okapi Framework. The Okapi Framework is licensed under the Apache License version 2.0.Localization automation tools facilitate the way you process localization data.
This can include text extraction, file format conversion, tokenization, VCS synchronization, term extraction, pre-translation, and various quality checks over common localization standard file formats. In some tool suites, like the Okapi Framework, you can create automation pipelines for performing various localization tasks. This can be very useful for a variety of situations, but their main utility is in the time they save by automating many tasks. They can also move you closer to a more continuous localization process.Tools to check out.Why open source is keyLocalization is most powerful and effective when done in the open. These tools should give you and your communities the power to localize your projects into as many languages as humanly possible.Want to learn more?
Check out these additional resources:. list. e-bookJeff Beatty will be talking about at, which will be held July 12-15 in Salt Lake City. Jeff Beatty - Jeff Beatty is the Head of Localization at Mozilla, the makers of the popular open source web browser, Firefox.
He holds a MSc in Multilingual Computing and Localisation from the University of Limerick. Jeff has also been featured as a localization expert in prominent global publications, such as The Economist, El Universal, Multilingual Magazine and others.
Jeff aims to showcase Mozilla's localization program, create disruptive, open source translation technology, and serve as intermediary. For more discussion on open source and the role of the CIO in the enterprise, join us at.The opinions expressed on this website are those of each author, not of the author's employer or of Red Hat.Opensource.com aspires to publish all content under a but may not be able to do so in all cases. You are responsible for ensuring that you have the necessary permission to reuse any work on this site. Red Hat and the Red Hat logo are trademarks of Red Hat, Inc., registered in the United States and other countries.Copyright ©2019 Red Hat, Inc.
Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |