R&D – NeuroSYS: AI & Custom Software Development

Test post for Zapier

admin — Tue, 28 Mar 2023 13:28:05 +0000

In the previous article on Elasticsearch, we’ve laid out essential facts about the engine’s mechanism and its components. This time, we would like to share some ideas you may come across or want to experiment with to boost your search performance. For the purposes of this article, we’ve conducted some experiments to illustrate the concepts and provide you with a ready-to-use code. Furthermore, based on our previous research experiences in the NLP domain, we will also try to explain what we found most beneficial when optimizing a search engine.

Experiments setup

To demonstrate the optimization ideas better, we have prepared two Information Retrieval datasets.

The first one is based on SQUAD 2.0 – a Question Answering benchmark that can be used for validation of the Information Retrieval (IR) as well, when adjusted properly. We’ve extracted 10.000 random documents and 1.000 of related questions. We treat each SQUAD paragraph as a separate index document.

From AI Workshop to Working Solution

Check Real Stories of Driving AI Innovations

Email address *

It is worth mentioning that Elasticsearch is designed by default for a much larger collection (millions of documents). However, we found the limited SQUAD version faster to compute and well-generalizing.

The second benchmark is our custom-prepared and much smaller. We used it to show how Elasticsearch behaves on smaller indices and how the same query types act differently on other datasets. The dataset is based on Stanford University lectures transcription of the SWIFT UI course, which can be found here. We’ve split the first seven lectures into 185 smaller ones, 25-sentenced documents with five sentences overlapping. We have also prepared 184 questions with multiple possible answers.

SQUAD paragraphs come from Wikipedia, so the text is concise and well written, and is not likely to contain errors. Meanwhile, the SWIFT UI benchmark consists of texts from recorded speech samples – it is more vivid, less concrete, but still grammatically correct. Moreover, it is rich in technical, software engineering-oriented vocabulary.

Comparison of two custom benchmarks used for purposes of this article

For validation of the Information Retrieval task, usually the MRR (mean reciprocal rank) or MAP (mean average precision) are used. We also use them on a daily basis; however, for the purpose of this article, to simplify the interpretation of outcomes, we have chosen the ones which are much more straightforward – the ratio of answered questions within top N hits: hits@10, hits@5, hits@3, hits@1. For implementation details see our NeuroSYS GitHub repository, where you can find other articles, and our MAGDA library.

Idea 1: Check the impact of analyzers on IR performance

As described in the previous article, we can use a multitude of different analyzers to perform standard NLP preprocessing operations on indexing texts. As you can probably recall, analyzers are by and large a combination of tokenizers and filters and are used for storing terms in the index in an optimally searchable form. Hence, experimenting with filters and tokenizers should probably be the first step you should take towards optimizing your engine’s performance.

To confirm the above statement, we present validation results of applying different analyzers to the limited SQUAD documents. Depending on the operations performed, the effectiveness of the search varies significantly.

We provide the results of experiments carried out using around 50 analyzers on the limited SQUAD sorted by hits@10. The table is collapsed for readability purposes; however, feel free to take a look at the full results and code on our GitHub.

Performance on limited SQUAD dataset with the use of different filters and tokenizers. Full table can be found on GitHub

Based on our observations of multiple datasets, we present the following conclusions about analyzers, which, we hope, will be helpful during your optimization process. Please bear in mind that these tips may not apply to all language domains, but we still highly recommend trying them out by yourselves on your datasets. Here is what we came up with:

stemming provides significant improvements,
stopwords removal doesn’t improve performance too much,
the standard tokenizer usually performs best, while the whitespace tokenizer does the worst,
usage of shingles (word n-grams) does not provide much improvement, while char n-grams can even decrease the performance,
simultaneous stemming while keeping the original words in the same field performs worse than bare stemming,
Porter stemming outperforms Lovins algorithm.

It is also worth noting that the default standard analyzer, which consists of a standard tokenizer, lowercase, and stop-words filters, usually works quite well as it is. Nevertheless, we were frequently able to outperform it on multiple datasets by experimenting with other operations.

Idea 2: Dig deeper into the scoring mechanism

As we know, Elasticsearch uses Lucene indices for sharding, which works in favor of time efficiency, but can also give you a headache if you are not aware of it. One of the surprises is that Elasticsearch carries out score calculation separately for each shard. It might affect the search performance, if too many shards are used. In consequence, the results can turn out to be non-deterministic between indexations.

Inverse Document Frequency is an integral part of BM25 and is calculated for each term, while putting documents into separate buckets. Therefore, the search score may differ more for particular terms, the more shards we have.

Nevertheless, it is possible to force Elasticsearch to calculate the BM25 score for all shards together, treating them as if they were a single, big index. However, it affects the search time greatly. If you don’t care about the search time but about the consistency/reproducibility, consider using Distributed Frequency Search. It will sum up all BM25 factors, regardless of the number of shards.

We have presented the accuracy of the Information Retrieval task in the below table. Note: It was our intention to focus on accuracy of the results and not on how fast we’ve managed to acquire them.

SWIFT UI – the impact of shards number and Distributed Frequency Search

It can be clearly seen that the accuracy fluctuates when changing the number of shards. It can also be noted that the number of shards does not affect the scores when using DFS.

However, with a dataset large enough, the impact of shards will be less. The more documents in an index, the more IDF parts of BM25 become normalized throughout shards.

Impact of shards number for different index sizes

In the table above, you can observe that the impact of the shards (a relative difference between DFS and non-DFS scores) is lower the more documents are indexed. Hence, the problem is less painful when working with more extensive collections of texts. However, in such a case, it is more probable that we would require more shards due to time performance. When it comes to smaller indices, we recommend setting the shards’ number to the default value of one and not worrying too much about the shards effect too much.

Idea 3: Check the impact of different scoring functions

BM25 is a well-established scoring algorithm that performs great in many cases. However, if you would like to try out other algorithms and see how well they do in your language domain, Elasticsearch allows you to choose from a couple of implemented functions or to define your own if needed.

Even though we do not recommend starting optimization by changing the scoring algorithm, the possibility remains open. We would like to present results on SQUAD 10k with the use of the following functions:

Okapi BM25 (default),
DFR (divergence from randomness),
DFI (divergence from independence),
IB (information-based),
LM Dirichlet,
LM Jelinek Mercer,
a custom TFIDF implemented as a scripted similarity function.

Impact of similarity function on limited SQUAD dataset – sorted by hits@10

Impact of similarity function on SWIFT UI dataset – sorted by hits@10

As you can see in the case of the limited SQUAD, the BM25 turned out to be the best-performing scoring function. However, when it comes to SWIFT UI, slightly better results can be obtained using the alternative similarity scores, depending on the metric we care about.

Idea 4: Tune Okapi BM25 parameters

Staying on the scoring topic, there are a couple of parameters the values of which can be changed within the BM25 algorithm. However, as in the case of choosing other scoring functions, we again do not recommend changing the parameters as the first steps of optimization.

The default values for parameters are:

b – term frequency normalization coefficient based on the document length,
k1 – term frequency non-linear normalization coefficient

They usually perform best across multiple benchmarks, which we’ve confirmed as well in our tests on SQUAD.

Keep in mind that despite the defaults being considered most universal, it doesn’t mean you should ignore other options. For example, in the case of the SWIFT UI dataset, other values performed better by 2% on the top 10 hits.

Impact of BM25 parameters on SWIFT UI sorted by hits@10 – a complete table to be found on GitHub

In this case, the default parameters turned out to be again the best for SQUAD, while SWIFT UI would benefit more from other ones.

Idea 5: Add extra data to your index with custom filters

As already mentioned, there are plenty of options in NLP, which text can be enriched with. We would like to show you what happens when we decide to add synonyms or other word derivatives like phonemes.

For the implementation details, we once again encourage you to have a glimpse at our repository.

Synonyms

Wondering how to make our documents more verbose or easier to query, we may try to extend the available wording used for document descriptions. However, this must be done with great care. Blindly adding more words to documents may lead to loss of their meaning, especially when it comes to the longer texts.

Automatic – WordNet synonyms

It is possible to automatically extend our inverted index with additional words, using synonyms from the WordNet synsets. Elasticsearch has a built-in synonyms filter that allows for an easy integration.

Below, we’ve presented search results on both SQUAD and SWIFT UI datasets with and without the use of all available synonyms.

SQUAD – the impact of using all WordNet synonyms

SWIFT UI – the impact of using all WordNet synonyms

As can be seen, using automatic, blindly added synonyms reduced the performance drastically. With thousands of additional words, documents’ representations get overpopulated; thus they lose their original meaning. Those redundant synonyms may not only fail to improve documents’ descriptiveness, but may also harm already meaningful texts.

The impact of using the WordNet synonyms analyzer on terms count

The number of terms in the SWIFT UI dataset has more than tripled when synonyms were used. It brings very negative consequences for the BM25 algorithm. Remember that the algorithm penalizes lengthy texts, hence documents that were previously short and descriptive may now be significantly lower on your search results page.

Meaningful synonyms

Of course, using synonyms may not always be a poor idea, but it might require some actual manual work.

Firstly, using spaCy, we’ve extracted 50 different Named Entities from the Swift programming language domain used in the SWIFT UI dataset.
Secondly, we’ve found synonyms for them, manually. As our simulation does not require usage of actual, existing words we have simply used random ones as business entities’ substitutes.
Finally, we have replaced occurrences of the Named Entities in questions with the selected word equivalents from the previous step, and added a list of the synonyms to the index with the synonym_analyzer.

Our intention was to create a simulation with certain business entities to which one can refer in search queries in many different ways. Below you can see the results.

Performance impact of using synonyms of business entities

Search performance improves with the use of manually added synonyms. Even though the experiment was carried out on a not too large sample, we hope that it illustrates the concept well – you can benefit from adding some meaningful words’ equivalents if you have proper domain knowledge. The process is time-consuming, and can hardly be automated; however, we believe it to be often worth the invested time and effort.

Impact of phonemes

It should be noted that, when working with ASR (automatic speech recognition) transcriptions, many words can be recognized incorrectly. They are often subject to numerous errors in transcription since some phrases and words sound alike. It might also happen that non-native speakers may mispronounce the words. For example:

To use a phonetic tokenizer a special plugin must be installed in the Elasticsearch node.

The sentence “Tom Hanks is a good actor as he loves playing” is represented as:

[‘TM’, ‘HNKS’, ‘IS’, ‘A’, ‘KT’, ‘AKTR’, ‘AS’, ‘H’, ‘LFS’, ‘PLYN’], when using Metaphone phonetic tokenizer,

and

[‘TM’, ‘tom’, ‘HNKS’, ‘hanks’, ‘IS’, ‘is’, ‘A’, ‘a’, ‘KT’, ‘good’, ‘AKTR’, ‘actor’, ‘AS’, ‘as’, ‘H’, ‘he’, ‘LFS’, ‘loves’, ‘PLYN’, ‘playing’], when using both phonemes and original words simultaneously.

Results of phonemes filter usage on SQUAD sorted by hits@10 – a complete table can be found on GitHub

We’ve come to the conclusion that using phonemes instead of the original text in the case of high-quality, non-ASR datasets like SQUAD does not yield much of an improvement. However, indexing phonemes and the original text in separate fields, and searching by both of them, slightly increased the performance. In the case of SWIFT UI the quality of transcriptions is surprisingly good, although the text comes from ASR. Therefore, the phonetic tokenizer is not applicable here as well.

Note: It might be a good idea to use phonetic tokenizers when working with more corrupted transcriptions, when the text is prone to typos and errors.

Idea 6: Add extra fields to your index

You might come up with the idea of putting additional fields to the index and expect them to boost the search performance. In Data Science it’s called feature engineering, or an ability to derive and create more valuable and informative features from available attributes. So, why not try deriving new features from text and index them in parallel as separate fields?

In this little experiment, we wanted to prove whether the above idea makes sense in Elasticsearch, and how to achieve it. We’ve tested it by:

extracting Named Entities using Transformer-based deep learning models from Huggingface ,
getting keywords by using the KeyBERT model,
adding lemmas from SpaCy

Note: The named entities, as well as keywords, are the excerpts already existing in the text but were extracted to separate fields. In contrast, lemmas are additionally processed words; they provide more information than available in the original text.

Additional fields indexing impact on limited SQUAD performance. Results are sorted by hits@10, multi-match query with cross-fields sub strategy used. More results can be found on GitHub.

While we were conducting the experiments, we discovered that, in this case, keywords and NERs did not improve the IR performance. On the contrary, word lemmatization seemed to provide a significant boost.

As a side note, we have not compared the lemmatization with stemming in this experiment. It’s worth mentioning that lemmatization is usually much trickier and can perform slightly worse in relation to stemming. For English, stemming is usually enough; however, in the case of other languages cutting off the suffixes will not suffice.

Based on our experience, we can also say that indexing parts of the original text without modifications, and putting them into separate fields, doesn’t provide much improvement. In fact, BM25 does just fine with keywords or Named Entities left in the original text, and thanks to the algorithm’s formula, it knows which words are more important than others, so there is no need to index them separately.

In short, it seems that fields providing some extra information (such as text title) or containing additionally processed, meaningful phrases (like word lemmas) can improve search accuracy.

Idea 7: Optimize the query

Last but not least, there are numerous options for creating queries. Not only can we change the query type but also we can boost individual fields in an index. Next to analyzer usage, we highly recommend experimenting with this step, as it usually improves the results.

We have conducted a small experiment, in which we have tested the following types of Elastic multi-match queries: best_fields, most_fields, cross_fields, on fields:

text – original text,
title – the title of the document, only if provided,
keywords – taken from KeyBERT,
NERs – done via Huggingface Transformers,
lemmas – extracted by SpaCy,

Alongside, we have boosted each field from the default value of 1.0 to 2.0 with increments of 0.25.

Usage results of different multi-match query subtype and fields weighing on limited SQUAD sorted by hits@10 – a complete table can be found on GitHub

As it has been proven above, the results on SQUAD dataset, despite being limited, show that queries of cross_field type provided the best results. What should also be noted is that boosting the title field was a good choice, as in most cases, it already contained important and descriptive data about the whole document. We’ve also observed that boosting only the keywords or NER fields gives the worst results.

However, as often happens, there is nothing like one clear and universal choice. When experimenting with SWIFT UI, we’ve figured that the title field is less important in this case, as it is often missing or contains gibberish. Also, when it comes to the query type, while cross_fields usually appears at the top, there are plenty of best_fields queries with very similar performance. In both cases, most_fields queries are usually placed somewhere in the middle.

Keep in mind that it all will most likely come down to analysis per dataset, as each of them is different, and other rules may apply. Feel free to use our code, plug in your dataset and find out what works best for you.

Conclusion

Compared to deep learning Information Retrieval models, full-text search still performs pretty well in plenty of use cases. Elasticsearch is a great and popular tool, so you might be tempted to start using it right away. However, we encourage you to at least read up a bit upfront and then try to optimize your search performance. This way you will avoid falling into a wrong-usage-hole and the attempts to get out of it.

We highly recommend beginning with analyzers and query optimization. By utilizing ready-to-use NLP mechanisms in Elastic, you can significantly improve your search results. Only then, proceed further with more sophisticated or experimental ideas like scoring functions, synonyms or additional fields.

Remember, it is crucial to apply methods appropriate to the nature of your data and to use a reliable validation procedure, adapted to the given problem. In this subject, there is no “one size fits all” solution.

Elasticsearch – search optimization ideas 2

admin — Fri, 07 Oct 2022 09:39:45 +0000

Experiments setup

To demonstrate the optimization ideas better, we have prepared two Information Retrieval datasets.

The first one is based on SQUAD 2.0 – a Question Answering benchmark that can be used for validation of the Information Retrieval (IR) as well, when adjusted properly. We’ve extracted 10.000 random documents and 1.000 of related questions. We treat each SQUAD paragraph as a separate index document.

The second benchmark is our custom-prepared and much smaller. We used it to show how Elasticsearch behaves on smaller indices and how the same query types act differently on other datasets. The dataset is based on Stanford University lectures transcription of the SWIFT UI course, which can be found here. We’ve split the first seven lectures into 185 smaller ones, 25-sentenced documents with five sentences overlapping. We have also prepared 184 questions with multiple possible answers.

Comparison of two custom benchmarks used for purposes of this article

Idea 1: Check the impact of analyzers on IR performance

Performance on limited SQUAD dataset with the use of different filters and tokenizers. Full table can be found on GitHub

stemming provides significant improvements,
stopwords removal doesn’t improve performance too much,
the standard tokenizer usually performs best, while the whitespace tokenizer does the worst,
usage of shingles (word n-grams) does not provide much improvement, while char n-grams can even decrease the performance,
simultaneous stemming while keeping the original words in the same field performs worse than bare stemming,
Porter stemming outperforms Lovins algorithm.

Idea 2: Dig deeper into the scoring mechanism

SWIFT UI – the impact of shards number and Distributed Frequency Search

It can be clearly seen that the accuracy fluctuates when changing the number of shards. It can also be noted that the number of shards does not affect the scores when using DFS.

However, with a dataset large enough, the impact of shards will be less. The more documents in an index, the more IDF parts of BM25 become normalized throughout shards.

Impact of shards number for different index sizes

Idea 3: Check the impact of different scoring functions

Okapi BM25 (default),
DFR (divergence from randomness),
DFI (divergence from independence),
IB (information-based),
LM Dirichlet,
LM Jelinek Mercer,
a custom TFIDF implemented as a scripted similarity function.

Impact of similarity function on limited SQUAD dataset – sorted by hits@10

Impact of similarity function on SWIFT UI dataset – sorted by hits@10

Idea 4: Tune Okapi BM25 parameters

The default values for parameters are:

b – term frequency normalization coefficient based on the document length,
k1 – term frequency non-linear normalization coefficient

They usually perform best across multiple benchmarks, which we’ve confirmed as well in our tests on SQUAD.

Impact of BM25 parameters on SWIFT UI sorted by hits@10 – a complete table to be found on GitHub

In this case, the default parameters turned out to be again the best for SQUAD, while SWIFT UI would benefit more from other ones.

Idea 5: Add extra data to your index with custom filters

For the implementation details, we once again encourage you to have a glimpse at our repository.

Synonyms

Automatic – WordNet synonyms

Below, we’ve presented search results on both SQUAD and SWIFT UI datasets with and without the use of all available synonyms.

SQUAD – the impact of using all WordNet synonyms

SWIFT UI – the impact of using all WordNet synonyms

The impact of using the WordNet synonyms analyzer on terms count

Meaningful synonyms

Of course, using synonyms may not always be a poor idea, but it might require some actual manual work.

Firstly, using spaCy, we’ve extracted 50 different Named Entities from the Swift programming language domain used in the SWIFT UI dataset.
Secondly, we’ve found synonyms for them, manually. As our simulation does not require usage of actual, existing words we have simply used random ones as business entities’ substitutes.
Finally, we have replaced occurrences of the Named Entities in questions with the selected word equivalents from the previous step, and added a list of the synonyms to the index with the synonym_analyzer.

Our intention was to create a simulation with certain business entities to which one can refer in search queries in many different ways. Below you can see the results.

Performance impact of using synonyms of business entities

Impact of phonemes

To use a phonetic tokenizer a special plugin must be installed in the Elasticsearch node.

The sentence “Tom Hanks is a good actor as he loves playing” is represented as:

[‘TM’, ‘HNKS’, ‘IS’, ‘A’, ‘KT’, ‘AKTR’, ‘AS’, ‘H’, ‘LFS’, ‘PLYN’], when using Metaphone phonetic tokenizer,

and

[‘TM’, ‘tom’, ‘HNKS’, ‘hanks’, ‘IS’, ‘is’, ‘A’, ‘a’, ‘KT’, ‘good’, ‘AKTR’, ‘actor’, ‘AS’, ‘as’, ‘H’, ‘he’, ‘LFS’, ‘loves’, ‘PLYN’, ‘playing’], when using both phonemes and original words simultaneously.

Results of phonemes filter usage on SQUAD sorted by hits@10 – a complete table can be found on GitHub

Note: It might be a good idea to use phonetic tokenizers when working with more corrupted transcriptions, when the text is prone to typos and errors.

Idea 6: Add extra fields to your index

In this little experiment, we wanted to prove whether the above idea makes sense in Elasticsearch, and how to achieve it. We’ve tested it by:

extracting Named Entities using Transformer-based deep learning models from Huggingface ,
getting keywords by using the KeyBERT model,
adding lemmas from SpaCy

Additional fields indexing impact on limited SQUAD performance. Results are sorted by hits@10, multi-match query with cross-fields sub strategy used. More results can be found on GitHub.

Idea 7: Optimize the query

We have conducted a small experiment, in which we have tested the following types of Elastic multi-match queries: best_fields, most_fields, cross_fields, on fields:

text – original text,
title – the title of the document, only if provided,
keywords – taken from KeyBERT,
NERs – done via Huggingface Transformers,
lemmas – extracted by SpaCy,

Alongside, we have boosted each field from the default value of 1.0 to 2.0 with increments of 0.25.

Usage results of different multi-match query subtype and fields weighing on limited SQUAD sorted by hits@10 – a complete table can be found on GitHub

Conclusion

Photoneo 3D camera tests and review

Marta Dunajko — Mon, 27 Jun 2022 12:23:53 +0000

When we say we keep tabs on the latest technologies, we mean it. Especially when it comes to the research & development that strives to break new ground with artificial intelligence/machine learning solutions, every day. However, we need to feed AI/ML algorithms with data, such as photos of Petri dishes to classify bacterial colonies, so they can learn in time and become more and more precise.

To provide data on objects in professional environments, particularly in the context of our Nsflow platform, we often use devices that capture images. Thus, 3D cameras seem to be a natural forward step in the processes connected with industrial automation, for example, robot guidance, quality inspection (including true shape control and dimensioning), or predictive maintenance (based on, among others, wear assessment).

Thus, to know what’s what, in our research and development unit we have been testing a Photoneo MotionCam-3D M. Does it live up to all our expectations? What effects can you expect? Keep on reading.

What is a 3D camera?

But let us start off on the right foot – what exactly we’re talking about. A three-dimensional camera is designed to capture images that provide the perception of depth, which isn’t achievable using run-of-the-mill devices e.g. traditional cameras. Depending on the type, 3D cameras can use different technologies. For example, stereo vision cameras try to mimic human binocular vision (analogously artificial intelligence tries to mimic human behavior). There are also cameras that use infrared light to measure distance, using the information on how long the beam travels. Some types combine both cameras and infrared projectors.

Again, based on the type, 3D cameras can use multiple sensors to capture different points of view. These perspectives are then merged and converted into a single 3D image or video.

Use cases

As we’ve mentioned at the beginning, we focus on industrial use cases to improve our clients’ manufacturing (and not only) processes. Here, we identify four areas in which 3D cameras can play a crucial role, mainly providing robots with sight – while our AI algorithms grant understanding of what they see:

Robot guidance: picking and placing/loading and unloading (irregular shapes included), packaging, and assembly – adjusting movement automatically
Inspection: measuring (distances, angles, holes), verifying completeness, fault recognition, quality assurance
Recognition: sorting objects based on their shape, size, and other features
Engineering: three-dimensional model creation and reverse engineering

Why Photoneo?

The 3D camera market offers a bunch of solutions. We have decided to test Photoneo MotionCam-3D M that declares to be the world’s highest-resolution and highest-accuracy 3D camera, created to combine with machine learning solutions which are our cup of tea. MotionCam is available in five sizes, that differ in scanning range (up to 3 meters) and accuracy. It can be used in two modes – camera and scanner. It uses technology similar to stereo vision cameras – but instead of two cameras Photoneo uses a camera and a structured light projector.

Photoneo MotionCam-3D testing

Getting started

The camera can be used in two modes – camera (video for dynamic scenes) and scanner (for more precision). It uses a single Ethernet cable for both power and data transfer, so we don’t need to have an additional power cord. Usually, this power supply works in a way that either we have to use a special router that is able to power the device or we can use an PoE injector that plugs in between the camera and the computer/router that powers the device. Suffice to say, camera mounting wasn’t particularly strenuous.

The object scanned – our Nsflow Box for industrial process optimization

Scanning the object

Photoneo MotionCam-3D can scan objects in two modes – either the camera is still and an object with markers are in motion or the other way round – camera is moving and object remains still.

For the purpose of this article, we decided to use the device in the former mode, when the camera remains still. As an object we’ve picked an industrial PC that is a part of our Nsflow Box solution, the precise model of which might be useful for our clients to plan it in their production halls. What we observed is that scanning works best when the object is placed on a matte surface.

The system can bring together multiple measurements for the most precise scans. The output of the object scanning is delivered in the form of point clouds.

PS Are 3D scans safe?

Lasers are safe for inanimate objects however remember not to look directly at the laser source – it can severely damage your sight.

Reference points marker

For scanning, we used reference point markers delivered with the camera by its producer. They simplify the process of merging the scans. When the object is in motion, markers should remain still with respect to it to work as a reference.

Black&white

MotionCam-3D scans are black & white, there is no information on the object’s color. In the case of point clouds though, it isn’t that important. If such data will be needed, you can connect an external camera that will add the color layer to Photoneo scans. The RGB camera needs to be in a fixed position w.r.t. the scanner, and calibrated with the use of the provided markers.

June 30 update: A new thing, there’s a color version of MotionCam-3D available.

3D camera app

To collect point clouds we used Photoneo scanning software. The control panel enabled handling the device, configuring the sensor parameters, and visualizing the output at the same time. To combine the points, convert them to the model, and clean the output we used MeshLab.

MeshLab open-source software

MeshLab software to clean and edit your scans

MeshLab software – a closeup view of the image and tools available

The effects

We’re aware that we’ve selected a pretty demanding object to be scanned, particularly because of its ridgy top surface. The truth is that the final effect required some manual refinement/cleaning so it mapped the objects precisely. Still, as you can see in the pictures and videos provided, the digital 3D camera has scanned our object successfully and we’re satisfied with the final result.

We expect that working longer with the camera would allow us to get better and better at object positioning and scanning, fully automating the process in the end.

Now that we know how it works and what effects we can expect, we see plenty of potential ways to use 3D cameras with our Nsflow platform. That’s it for now, but we recommend you read about AI in production process optimization.

Measuring distance with sound

admin — Fri, 27 May 2022 12:19:58 +0000

Measuring distances between objects is a type of data commonly obtained in a variety of industries and endeavours. The most recent examples are autonomous vehicles (how far an object in our way is), automatic ship berthing, or industrial object positioning.

One of the most common ways to do it is by using ultrasonic sensors. The grounds for it are simple – the majority of objects and substances reflect sound waves. Thus, today, we’d like to tell you more about one of our smaller projects in the field of sound measurement developed within our research and development division which aimed to solve one specific problem found in the field of geriatrics.

The story

A while ago, a Germany-based company reached us with a thorny problem. They were conducting a study in retirement homes on preventing cognitive impairment (dementia, Alzheimer’s, etc.) with a combination of physical and brain exercises.

To do so, pensioners had to remember the order of stations they had to visit displayed on a tablet. Then, holding a special object (similar to a baton in relay races) they moved from station to station. Apart from working memory, what was measured was seniors’ physical activity, as they had to move between the stands.

So far, the researchers measured the distances between multiple stations with a tape measure which wasn’t effective and caused a sense of frustration as well. Thus, our task was to come up with a method that would not only automate the measurement process but also grant the precision required.

How to measure the distance

The use of sound wasn’t an obvious choice at first. We’ve been considering – and testing – different methods before reaching a verdict, such as:

laser distance metre (didn’t pass our verification – no automation)
beacons (passed the initial verification but measurements lacked accuracy)
sound waves (passed the initial verification)
computer vision (we assumed that it might work but the users would have to mark each stand and attach a camera to it – no automation again).

As you can see, only sound waves passed the initial verification and seemed promising. However, as with any method, it wasn’t perfect, but we’ll delve into it in the next sections.

The measuring method

To measure distance with sound we have to use a transmitter that sends a short sound and a receiver that catches it on its return. The sound travels from the transmitter, bounces from an obstacle on its way, and returns to the receiver. Knowing the speed of sound, in this case travelling in the air, we can easily calculate the distance it covers. The devices available in the market oftentimes have both transmitter and receiver built-in.

However, using one device only affects accuracy and the range we can measure. Thus, we’ve decided to use two separate devices. They were transmitting the signal one by one and recording the entire process.

The use of two separate devices in measuring distance with sound

Signal design

The emitted signal had to be easy to be both produced and detected. Since we used mobile devices that are designed to work best at human voice frequencies, it determined the bandwidth to comply with. Our signal had also to be easily distinguishable from ambient noise. To do so, the system calculated the correlation with the pattern signal. The higher the correlation, the bigger chance the sound received is what we’re looking for.

Signal parameters:

type: linear chirp (increases in frequency linearly with time)
bandwidth: 2-4 kHz
duration: 100 ms

Measurement errors and their sources

No method is flawless though. The sources of error in this approach included:

unsynchronized clocks
signal sending delay (caused by the operating system)
signal receiving delay (caused by the operating system)
multipath effect (reflected signals).

Just a word of clarification regarding errors caused by the operating system. The delays mentioned above are a result of the lapse of time between the moment we issue a command that triggers the sound and the actual sound produced by the speakers. This time depends, among other things, on how heavily loaded the device is.

BeepBeep ranging

In this project, we’ve utilised an acoustic-based ranging system called BeepBeep. It has been developed with commercial off-the-shelf (COTS) devices, such as mobile phones, in mind. What was particularly important in our case, the solution is purely software-based and operates device-to-device with no extra infrastructure needed, which translates to lower costs.

The high accuracy is achieved through two-way sensing, self-recording, and calculating the distance measured by devices A and B, and B and A. BeepBeep helped us to eliminate three out of four sources of error described in the previous section, such as clock synchronisation and operating system delays. Although it didn’t resolve the problems connected with the multipath effect, BeepBeep allowed for its mitigation up to a certain distance, namely around 6 metres. Read more on the BeepBeep system.

Communication interface

When it comes to the communication interface, we’ve decided to make use of Google’s Nearby Connections. This peer-to-peer networking API allows applications to exchange data between nearby devices even when no internet connection is available. The transfer is fully encrypted and thus secure.

We used it to send two types of messages, namely:

start
elapsed time between the two times-of-arrival

The sequence diagram in measuring distance

The results

We have to admit, the results were fairly surprising, especially after the fiasco with beacon testing. Up to the distance of 6 metres (20 feet) the system achieved precise calculations, where the margin of error was less than 5 centimetres (2 inches). The systematic error results from the wrong values set for the distances between the device microphones and the speakers. In the table, you can see the exact results of our measurements. The major errors occurred only when we placed obstacles between devices.

The results of our measurements

The main advantage of measuring distance with sound is that the method is pretty accurate and at the same time cost-effective – there are plenty of device choices in the market. What is more, our measurement accuracy is not affected by the lighting conditions or the object features, such as colour or transparency. On the other hand, the distance covered accurately, up to 6 metres in our case, might not be enough in every project.

Generation of a synthetic microbial dataset with deep learning style transfer

admin — Fri, 06 May 2022 08:43:03 +0000

Deep learning models achieve considerably higher accuracy than traditional computer vision algorithms. When working on image recognition using traditional methods, feature extraction algorithms are tuned by hand which in many cases is a time-consuming procedure. On the contrary, in deep convolutional networks feature engineering is performed automatically – the network learns how to extract the best feature maps on its own and optimizes kernels in subsequent convolutions to preserve only relevant information from images.

So now we don’t need to spend weeks seeking optimal parameters. But it comes with a price. To get adequate results with a complex deep learning model we need large enough datasets. Collecting and annotating big datasets requires a lot of time and financial resources. Moreover, the labeling process itself can be challenging. Synthetic data is a promising alternative to deal with the lack of large enough datasets and to reduce the resources and costs associated with collecting such data [1]. Moreover, it might help institutions to share knowledge, e.g. datasets in highly specialized areas, while protecting individual privacy.

Our goal was to identify microbial colonies on Petri dishes – a typical task in microbiology. The assignment happened to be tough even for trained professionals, because some colonies tend to agglomerate and overlap, thus becoming indistinguishable for non-experts. In this article, we will present an effective strategy to generate an annotated synthetic dataset of microbiological images, that we have already published in Nature Scientific Reports magazine [2]. A generated dataset is then used to train deep learning object detectors in a fully supervised fashion. The generator employs traditional computer vision algorithms together with a neural style transfer method for data augmentation. We show that the method is able to synthesize a dataset of realistic-looking images that can be used to train a neural network model capable of localizing, segmenting, and classifying five different microbial species. Our method requires significantly fewer resources to obtain a useful dataset than collecting and labeling a whole large set of real images with annotations.

We show that starting with only 100 real images, we can generate data to train a detector that achieves comparable results [3] to the same detector but trained on a real, several dozen times bigger microbial dataset [4] containing over 7k images.

Generating a synthetic dataset

Let us now present a detailed description of the method. The goal is to generate synthetic images with microbial colonies that will be later used to train deep learning detection and segmentation models. The pipeline is presented in Fig. 1. Note that the source code with the Python implementation of our generation framework is publicly available.

Fig. 1. A scheme for synthetic dataset generation pipeline. Microbial colonies are segmented from real images using traditional computer vision algorithms and then randomly arranged on fragments of an empty dish giving synthetic patches with precise annotations. To improve the realism of generated data, patches are then stylized using a neural style transfer method. Figure adapted from [2].

We start with labeled real images of Petri dishes and perform colony segmentation using traditional computer vision algorithms, including proper filtering, thresholding in CIELab color space, and energy-based segmentation – we use a powerful Chan-Vese algorithm. To get a balanced working dataset, we randomly select 20 images for each of the 5 microbial species (giving 100 images in total) from the higher-resolution subset of the recently introduced AGAR dataset [4].

In the second step, the segmented colonies and clusters of colonies are randomly arranged on fragments of an empty Petri dish (we call them patches). We select a random fragment of one out of 10 real empty dish images. We repeat this step a lot of times, placing subsequent clusters in random places, making sure they do not overlap. Simultaneously, we store the position of each colony from the cluster placed on the patch and its segmentation mask, creating a dictionary of annotations for that patch. We present examples of generated synthetic patches in Fig. 2.

Fig. 2. Examples of synthetic patches with microbial colonies before the stylization step. Figure adapted from [2].

As we can see in Fig. 2, in some situations colonies do not blend well with the background, and their color does not match the background color. To deal with this problem and improve the realism of generated data, in the third step we apply data augmentation using a neural style transfer method. We transfer the style to a given raw patch from one of the selected real images that serve as style carriers. We select 20 real fragments with significantly different lighting conditions to increase the diversity of the generated patches. The exemplary patches after the stylization step are presented in Fig. 3. We use a fast and effective deep learning stylization algorithm introduced in [5]. This method gives us the most realistic stylization of our raw microbial images without introducing any unwanted artifacts.

Fig. 3. The stylization of the generated microbial images. Five synthetic patches (left) are stylized using the style of five real images (top). Figure adapted from [2].

Training deep learning models

Using the method, we generated about 50k patches that were next stylized. The idea behind the conducted experiments was to train a neural network model using synthetic data to detect microbial colonies and then test its performance on real images with bacterial colonies on a Petri dish. We train popular R-CNN detectors using our synthetic dataset. The examples of the Cascade R-CNN [6] detector evaluation on real patches from the AGAR dataset are presented in Fig 4. The model performs quite well detecting microbial colonies of various sizes under different lighting conditions.

Automatic instance segmentation is a task useful in many biomedical applications. During the patch generation, we also store a segmentation mask at a pixel level for each colony. We used this additional information to train a deep learning instance segmentation model – Mask R-CNN [7] which extends the R-CNN detector that we have already trained. The segmentation results for real samples are also presented in Fig. 4. Obtained instance segmentations for different microbial colony types correctly reproduce the colony shapes.

Fig. 4. Examples of microbial colonies detection (marked by green bounding boxes) on real data – fragments of Petri dishes with different microbial species from the AGAR dataset. Results for instance segmentation are presented as color segmentation masks. Figure adapted from [2].

One of the main applications of object detection in microbiology is to automate the process of counting microbial colonies grown on Petri dishes. We verify the proposed method of synthetic dataset generation by comparing it with a standard approach where we collect a big real dataset and train the detector for colony identification and counting tasks.

We train the R-CNN detector (Cascade) on a 50k large dataset generated using 100 images from the higher-resolution AGAR subset and test microbial colonies counting in the same task as performed in [4]. Results are presented in Fig. 5 (right). It turns out that the detection precision and counting errors for the synthetic dataset are only slightly worse [3] than for the same detector but trained on the whole big dataset containing over 7k real images giving about 65k patches. It is also clear that introducing style transfer augmentation improves the detection quality greatly, and without the stylization step the results are rather poor – see results for the raw dataset, i.e. obtained without the stylization step, in Fig. 5 (left).

Fig. 5. Microbial colonies counting tested on real data for two different synthetic training datasets: raw – without stylization (left) and stylized (right). Stylization greatly improves the detection performance. In ideal detection, every black dot representing a single Petri dish image should lay on the y = x black line. Figures adapted from [2].

Summary

We introduced an effective strategy to generate an annotated synthetic dataset of microbiological images of Petri dishes that can be used to train deep learning models in a fully supervised fashion. By using traditional computer vision techniques complemented by a deep neural style transfer algorithm, we were able to build a microbial data generator supplied with only 100 real images. It requires much less effort and resources than collecting and labeling a large dataset containing thousands of real images.

We prove the usefulness of the method in microbe detection and segmentation, but we expect that being flexible and universal, it can also be applied in other domains of science and industry to detect various objects.

References

[1] https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data

[2] J. Pawłowski, S. Majchrowska, and T. Golan, Generation of microbial colonies dataset with deep learning style transfer, Scientific Reports 12, 5212 (2022).

https://doi.org/10.1038/s41598-022-09264-z

[3] Detection mAP = 0.416 (bigger is better), and counting MAE = 4.49 (smaller is better) metrics, compared with mAP = 0.520 and MAE = 4.31 obtained the same detector but trained using the AGAR dataset [4].

[4] S. Majchrowska, J. Pawłowski, G. Guła, T. Bonus, A. Hanas, A. Loch, A. Pawlak, J. Roszkowiak, T. Golan, and Z. Drulis-Kawa, AGAR a Microbial Colony Dataset for Deep Learning Detection (2021). Preprint available at arXiv [arXiv:2108.01234].

[5] Ming Li, Chunyang Ye, and Wei Li, High-resolution network for photorealistic style transfer (2019). Preprint available at [arXiv:1904.11617].

[6] Z. Cai, and N. Vasconcelos, Cascade R-CNN: Delving into high quality object detection, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6154–6162 (2018).

[7] K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, IEEE International Conference on Computer Vision (ICCV), 2980–2988 (2017).

Project co-financed from European Union funds under the European Regional Development Funds as part of the Smart Growth Operational Programme.
Project implemented as part of the National Centre for Research and Development: Fast Track.

Intro to coreference resolution in NLP

p.kozlowski@dev.neurosys.com — Wed, 04 May 2022 22:21:00 +0000

Introduction

Natural language processing (NLP) refers to the communication between humans and machines. NLP is one of the most challenging branches of Artificial Intelligence mainly because our human language is full of exceptions and ambiguities which are hard for computers to learn. One way of making it easier for them is to get rid of any imprecise expressions that need a context to be clearly understood. A good example is pronouns (e.g. it, he, her) which can be replaced with specific nouns they are referring to.

But how about a real-world application?

While working on a Question Answering System for the LMS platform we’ve encountered several problems. Especially with sentence embeddings – vector representations of text. It happens that sometimes a sentence consists of many pronouns. Such embeddings often don’t reflect the original sentence correctly when sufficient context isn’t provided. In order to obtain richer embeddings, we’ve applied coreference resolution to our pipeline.

What is coreference resolution?

Coreference resolution (CR) is the task of finding all linguistic expressions (called mentions) in a given text that refer to the same real-world entity. After finding and grouping these mentions we can resolve them by replacing, as stated above, pronouns with noun phrases.

Coreference resolution is an exceptionally versatile tool and can be applied to a variety of NLP tasks such as text understanding, information extraction, machine translation, sentiment analysis, or document summarization. It is a great way to obtain unambiguous sentences which can be much more easily understood by computers.

Coreference vs. anaphora resolution

It should be noted that we refer to coreference resolution as to a general problem of finding and resolving references in the text. However, technically there are several kinds of references and their definitions are a matter of dispute.

The one case most distinguished from coreference resolution (CR) is anaphora resolution (AR). The relation of anaphora occurs in a text when one term refers to another determining the second’s one interpretation [3]. In the example below, we see that (1) and (2) directly refer to different real-world entities however they are used in the same context and our interpretation of (2) relies on (1). These mentions do not co-refer but are in the relation of anaphora.

Even though anaphora resolution is distinct from coreference resolution, in the vast majority of cases one equals the other. There are many more examples of such differences and various other kinds of references. However, CR has the broadest scope and covers the vast majority of cases. As we would like to simplify this topic, from now on we are going to assume that all types of relations between terms are coreferential.

Different types of references

Even if we assume that we can treat all kinds of references as a coreference, there are still many different forms of relations between terms that are worth noting. That’s because every kind can be treated differently and most classic natural language processing algorithms are designed to target only specific types of references. [1]

Anaphora and cataphora

These are the bread and butter of our topic. The main difference is that anaphora occurs in the sentence after the word it refers to and cataphora is found before it. The word occurring before an anaphora is called an antecedent and the one following a cataphora is a postcedent.

Split antecedents

It’s an anaphoric expression where the pronoun (2) refers to more than one antecedent (1).

Coreferring noun phrases

It’s also an anaphoric example of a situation in which the second noun phrase (2) is a reference to an earlier descriptive form of an expression (1).

Presuppositions / bound variable

Some argue whether presupposition can be classified as a coreference (or any other “reference”) resolution type. That’s because a pronoun (2) is not exactly referential – in the sense that we can’t replace it with the quantified expression (1). However, after all the pronoun is a variable that is bound by its antecedent [3].

Misleading pronominal references

There are also certain situations that can be misleading. It’s when there is no relationship between a pronoun and other words in the text and yet the pronoun is there. While creating a CR algorithm we need to pay special attention to those kinds of references so it’s good to know in what situations we come into contact with them.

Clefts

A cleft sentence is considered to be a complex expression which has a simpler, less deceptive substitution. It’s a case where the pronoun “it” is redundant and we can easily come up with a sentence that has the same meaning but doesn’t use the pronoun.

Pleonastic “it”

This type of reference is very common in English so it requires an emphasis. The pronoun “it” doesn’t refer to any other term but it is needed in the sentence in order to make up a grammatical expression.

Steps for coreference resolution by example

It’s always best to visualize an idea and provide a concrete example as opposed to just theorizing about a topic. What’s more, we’ll try to explain and give concrete examples of the most common terms, associated with coreference resolution that we may come across in articles and papers.

The first step in order to apply coreference resolution is to decide whether we would like to work with single words/tokens or spans.

But what exactly is a span? It’s most often the case that what we want to swap or what we are swapping for is not a single word but multiple adjacent tokens. Therefore span is a whole expression. Another name for it you may come across is a mention. They are often used interchangeably.

In most state of the art solutions, only spans are taken into consideration. It is so since spans carry more information within them, while single tokens may not convey any specific details on their own.

Step 1 – identify potential spans

The next step is to somehow combine the spans into groups.

As we can see in this great quote from J.R.R. Tolkien, there are several potential spans that could be grouped together. Here we have spans like “Sam” or “his” that have only a single token in them, but we also see the span “a white star” consisting of three consecutive words.

Combining items is referred to as clustering or grouping. It is, as its name suggests, a method of taking arbitrary objects and grouping them together into clusters/groups within which these items share a common theme. These can range from words in NLP, through movie categories on Netflix, to grouping food based on their nutritional values.

There are many ways one may group, but what’s important is things in the same group should possess similar properties and be as different as possible from other clusters.

Step 2 – group spans

Here the “property” we are looking for is the spans referring to the same real-world entity.

The resulting groups are [Sam, his, he, him] as well as [a white star, it]. Notice that “Sam” and “a white star” are marked as entities. This is a crucial step in coreference resolution. We need to not only identify similar spans but also determine which one of them is, often referred to as, the real-world entity.

There is no single definition of a real-world entity but we will simply define it as an arbitrary object that doesn’t need any extra context to clarify what it is, in our example: “Sam”, or “a white star”. On the other hand, “his” or “him” are not real-world entities, since they must be accompanied by additional background information.

Step 3 – replace pronouns with real-world entities

As we can see [his, he, him] and [it] have been replaced with the real-world entities, from the corresponding groups – “Sam” and “a white star” respectively. As a result, we obtained a text without any pronouns while still being valid grammatically and semantically.

Summary

The aim of Coreference Resolution is to find, group and then substitute any ambiguous expressions with real-world entities they are referring to.

We’ve discussed a difference between coreference and anaphora resolution as well as shown and explained a couple of common problems associated with them. We’ve also managed to walk through the typical process of CR using an example.

By doing so, sentences become self-contained and no additional context is needed for the computer to understand their meaning. It won’t always be the case where we have well-defined entities but more often than not coreference resolution will lead to information gain.

This is only the first article in the series concerning coreference resolution and natural language processing. In the next one, we will show the pros and cons of the biggest deep learning solutions that we’ve tested ourselves and finally decided to implement in our system.

References

[1]: Rhea Sukthanker, Soujanya Poria, Erik Cambria, Ramkumar Thirunavukarasu (July 2020) Anaphora and coreference resolution: A review https://arxiv.org/abs/1805.11824

[2]: Sharid Loaiciga, Liane Guillou, Christian Hardmeier (September 2017) What is it? Disambiguating the different readings of the pronoun ‘it’ https://www.aclweb.org/anthology/D17-1137/

[3]: Stanford lecture (CS224n) by Christopher Manning (2019) https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1162/handouts/cs224n-lecture10-coreference.pdf

[4]: Coreference Wikipedia https://en.wikipedia.org/wiki/Coreference

Recognition and counting of microorganisms on Petri dishes

admin — Wed, 13 Oct 2021 19:18:27 +0000

The manufacturing processes in pharmaceutical, cosmetic, or food industries are under strict policies and regulations, and obligate manufacturers to perform constant microbiological monitoring. This means thousands of samples, usually in the form of standard Petri dishes (with microbial cultures grown on agar medium), that have to be analysed and counted manually by experienced microbiologists. This is a time-consuming and error-prone process, which requires a trained professional. To avoid these issues, an automated method applied for that task would be very appreciated.

In this article we will present deep learning methods of analysing microbiological images, developed by the NeuroSYS Research team. The crucial thing in training machine learning models is to gain large, well constructed dataset. Thus we will utilize the AGAR dataset introduced in our previous post to train a model that counts and classifies bacterial colonies grown on Petri dishes based on their RGB images.

Do you know this clue? – a dumb algorithm with lots of data beats a clever one with smaller amounts of it.

Detection of microbial colonies

Ok, so let’s start with detecting microbes. Imagine that we have images of a Petri dish (this circle glass commonly used to keep growth medium for multiplying microbial cells in laboratories). Exemplary photos of such dishes in different setups of AGAR images are presented in the left column in Figure 1. The other 5 columns present fragments of photos (we call them patches) containing 5 different microbe types. Now it is easy to understand what microbe detection means. We simply have to determine the exact position and size of each microbe by marking it with a blue rectangle (we call it bounding box) as in Figure 1 and zoomed in Figure 2.

Figure 1. Different types of microbes (columns 2-6) grow inside a Petri dish (1st column) in different lighting setups (each row). Each blue bounding box was predicted by a trained detector.

It seems to be easy for trained professionals, but note that microbe colony edges may be blurred, colony itself very small (even few pixel large), or camera settings (e.g. focus or lighting) very inadequate (see for example lighting conditions in 3rd row in Figure 1). Moreover some colonies may overlap which makes decisions where one colony ends and another starts very challenging. That’s why it is really difficult to build an automatic system for microbial colony localization and classification.

Figure 2. Zoomed patches with predictions made by our detector. Note that colonies may have very different sizes: some colonies are very small, some big with blurred edges, and some may overlap — this makes detection really challenging.

To do so we have developed a deep learning model for microbes detection. Deep learning is a family of AI models that utilizes mainly artificial neural networks. Such modern approaches turn out to be extremely successful in many areas, for example in computer vision or machine translation. Deep learning object detectors (in our case detecting microbial colonies) are very complicated multistage models with hundreds of layers, each consisting of hundreds of neurons.

AI tries to solve tasks that are relatively easy for humans but extremely difficult to be programmable.

Here we adopt two-stage detectors from the Region-based Convolutional Neural Network (R-CNN) family [2,3], which are known to be slow but very precise (in comparison to single-stage detectors, e.g. famous YOLO [4]). See Figure 3 for a short explanation on how it works. For a more detailed explanation of various object detection algorithms see our previous blog post on this matter.

Figure 3. Two-stage architecture of the R-CNN detector. First stage (a) generates region proposals which are just smaller parts of the original image that we think could contain the objects we are searching for. In the second stage: (b) we take each region proposal and create a feature vector representing this area using a deep CNN, (c) and classify each of the proposals: is it relevant and if yes, what class of object does it contain. Figure adapted from [2].

Training the detector

After presenting the results of the microbe detection let’s check how the detector works. This contains a neural network supervised training process but not only: also data preprocessing and postprocessing is present in the training scheme in Figure 4. To train a deep learning model in a supervised manner we need a labeled dataset. As mentioned previously we use here AGAR dataset consisting of images of Petri dishes with labelled microbial colonies.

Characteristic feature of neural networks is that the model’s architecture strictly corresponds to the input size. When training (and evaluating) the network we are limited by available memory, thus we are not able to process the whole high resolution image at once, so we have to divide it into many patches. This process is not straightforward because during cutting into patches we have to ensure that a given colony appears in its entirety on at least one patch.

After that, we were prepared to train the detector (upper row in Figure 4). We selected 8 different models from the R-CNN family to make a comprehensive comparison. After the detectors were trained we tested them (lower pipeline in Figure 4) on photos (in fact on patches) unseen during the training to make sure that tests were done fairly. Note that the patches prepared for testing are simply cut off evenly—at this stage we cannot include information about where the bounding boxes lie.

Figure 4. Flowchart for supervised training and evaluating (testing) our neural network models of microbial colonies detector.

Detection and counting results

We have seen in Figures 1 and 2 that our models detect colonies quite well. But how to describe the performance of detection quantitatively? There are standard numbers (metrics) that we may calculate to describe performance of the model on a test set. One of them, most popular, is called Average Precision (AP) or mean Average Precision (mAP) in case of multiclass detection (for detailed definition see this post). AP and mAP results for two selected R-CNN models (Faster ResNet-50 and Cascade HRNet) evaluated on two subsets of AGAR dataset (higher- and lower-resolution) are presented in Figure 5 (table on the left).

Generally the higher AP value the more precise detection is – the predicted and true bounding boxes better fit to each other. Note, however, that the situation is a bit complicated here because we have different microbe types which means that in addition to finding colonies, detector also needs to classify them.

Different classes of microbes are being detected with different fidelites and this affects mAP as seen in Figure 5. For example small colonies albeit with sharp edges of S. aureus bacteria are detected and marked better (AP about 65%) than big but blurred colonies of P. aeruginosa (AP about 50%) that also tend to aggregate. It is also worth mentioning that our results seem to be excellent compared to reports done with the same architectures on the famous COCO dataset: 45% for Cascade R-CNN and 37% for Faster R-CNN [5].

The final task strictly related to detection of every colony on the Petri dish is counting. After detecting all the microbial colonies we sum them up and compare this number with the ground truth number of colonies for a given Petri dish. The results for counting by the same two models on the AGAR test subsets are presented in Figure 5 (plots on the right).

On the x-axis we have the ground truth number of colonies for different dishes, estimated by trained professionals, while on the y-axis we have the value predicted by our models – each pair (truth, predicted) is represented by a single black point on these plots. It is obvious that in case of ideal predictions all points should lie on the y = x curve represented by black line. Luckily, the vast majority of points lie near this curve – the models count quite well. Two additional blue curves mark +/- 10% counting error, and we may see that only some minority of points (especially higher populated dishes with more than 50 colonies) lay outside this area.

The average counting errors were measured by the mean absolute error (MAE), defined e.g. in this blog, and so called symmetric mean absolute percentage error (sMAPE), which measures accuracy based on percentage errors [6]. In general, sMAPE do not exceed 5% which is quite a reasonable result.

Figure 5. The quality of microbial colonies detection: on the left there are presented results of average precision metric that describe fidelity of detection itself, while on the right there is comparison of colonies counting—predicted vs. truth number of colonies (ideal predictions lay on y=x black curve).

Conclusions

In summary, in this article we present deep learning studies on recognition of microorganisms on Petri dishes. The selected R-CNN models perform very well in detecting microbial colonies. The detection is facilitated by the fact that the colonies have similar shapes and all species of microbes are well represented in the training data, proving the utility of the AGAR dataset. Moreover, the results obtained with base Faster R-CNN and more complex Cascade R-CNN do not differ much.

As discussed above, the detectors are more accurate for samples with less than 50 colonies. However, they still give very good estimates for dishes with hundreds or even thousands of colonies, like these presented in Figure 6, correctly identifying single colonies in highly populated samples. In the extreme case, the maximum number of detected colonies on one plate was equal to 2782. It is worth noting that it took seconds for the deep learning system, while it could take up to an hour in the case of manual counting. Moreover in some situations the detectors were able to recognize colonies difficult to see and missed by humans. These cases confirm the benefits of building an automatic microbial detection system, and that this can be successfully achieved using modern deep learning techniques.

References

[1] P. Domingos, A few useful things to know about machine learning, Commun. ACM, vol. 55, pp. 78–87, 2012.

[2] R. Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.

[3] A. Mohan, Object Detection and Classification using R-CNNs, very detailed blog on RCNN models, 2018.

[4] J. Redmon et al., You only look once: Unified, real-time object detection, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[5] J. Wang et al., Deep high-resolution representation learning for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[6] S. Majchrowska, J. Pawłowski, G. Guła, T. Bonus, A. Hanas, A. Loch, A. Pawlak, J. Roszkowiak, T. Golan, and Z. Drulis-Kawa, AGAR a Microbial Colony Dataset for Deep Learning Detection, 07 July 2021, Preprint available at arXiv [arXiv:2108.01234].

Visual Place Recognition – part 2

admin — Fri, 10 Sep 2021 13:04:03 +0000

At NeuroSYS we are currently working on user localization for our AR platform called nsFlow, which enables factory workers to do their duty by displaying instructions to them through smart glasses, so no prior training or supervision are needed. As we explained in our previous post, knowledge about the employee’s location is crucial to ensure proper guidance and safety in the factory.

To solve the user localization problem we utilized algorithms from the Visual Place Recognition (VPR) field.

In the first part of this series, we provided a general introduction to VPR. Today we would like to present the solution we came up with for nsFlow.

As stated in our last post, VPR is concerned with recognizing a place based on its visual features. The recognition process is typically broken down into 2 steps. First, a photo of the place of interest is taken and keypoints (regions that stand out in some way and are likely to be also found in other images of the same scene) are detected on it. Next, they are compared with keypoints identified on the reference image and if the 2 sets of keypoints are similar enough, the photos can be considered as representing the same spot. The first step is carried out by a feature detector and the second step is performed by a feature matcher.

But how can this be applied to user localization?

Since we didn’t need the exact location of the user and we only wanted to know in which room or at which workstation he/she was staying, the problem can be simplified to place recognition. To that end, we used algorithms belonging to the group of VPR. Specifically, we focused on Superpoint and Superglue, which are currently state-of-the-art in feature detection and matching. Additionally, we applied netVLAD for faster matching.

So much for the reminder from the last post. Now let’s move on to the most interesting part of this series, which is our solution.

Fig 1. The overall architecture of our VPR system. The green path shows how the initial ranking based on global descriptors is created. The red path represents the processing of local descriptors to get the final score for a query image.

Databases

As you can see in the graph above our system contains two databases:

an image database;
a room database.

The role of the first one is to store the image of each location (possible workstation or room), as well as several additional properties, namely:

a unique identifier;
images global descriptors;
images keypoints and local descriptors.

The room database associates each unique identifier with a room. A structure like this allows the system to be distributed between local machines (room database) and a computational server (image database), thus increasing the robustness and performance. Let’s now take a closer look at some of the image properties.

Keypoints detection and matching

As stated above, a VPR system needs a feature detector to identify keypoints and a feature matcher to compare them with the database and choose the most similar image. Each keypoint contains its (x, y)^T coordinates and a vector describing it (called the descriptor). The descriptor identifies the point and should be invariant to perspective, rotation, scale and lighting conditions. This allows us to find the same points on two different images of the same place (finding the pairs of keypoints is called matching).

In our case we used a deep neural network called SuperPoint to detect keypoints. We chose it over classical methods of computing features, because it is able to extract more universal information. The other advantage of selecting SuperPoint is the fact that it performs better in tandem with the feature matching deep neural network named SuperGlue compared to other keypoint extractors.

SuperGlue also shows improved robustness in comparison to classical feature matching algorithms. In order to use it we needed to implement the network from scratch based on this paper. This was a challenge in itself and might be a topic of a future article. With our implementation we achieved results similar to those from the paper. The image below exemplifies how our network performs.

Fig 2. An example of matched keypoints found by our VPR system.

Even though SuperPoint and SuperGlue work at around 11 FPS (2x NVIDIA GeForce RTX 2080 Ti), calculating the matches for all images from the database would be ineffective and introduce high latency in the localization system. To solve this problem we added one step before local feature matching allowing us to roughly estimate the similarity and further process only the frames that are the most promising. Here we introduce the concept of global descriptors and their matching.

Global Descriptors and matching

In order to roughly estimate the similarity between two images we use global descriptors. They take the form of a vector that uniquely identifies the scene in a global sense. Here are some properties that the global descriptor should have:

it should be invariant to the point of view – the same scene viewed from different perspectives should have global descriptors that are near each other in the vector space;
it should be invariant to lighting conditions – the same scene viewed at different times of the day and under different weather conditions should have similar global descriptors;
it should be insusceptible to temporary objects – the descriptor should not encode information about cars parked in front of the building or people walking by, but only information about the building itself.

In our case we used a deep neural network named NetVLAD to calculate the global descriptors. The network returns a vector that has all the aforementioned properties.

Similarly to brute-force local descriptor matching we calculate the distances between one descriptor and all others. Then we further process the images of the top N “most similar” (closest) descriptors. This process can be called global descriptor matching.

Combining all parts together

So far we have explained the basic concepts upon which our solution is built and introduced neural networks that we used. Now is the time to combine these blocks into one working system.

As mentioned previously there exist two databases: one associating each image’s identifier with a room (this for simplicity is called the room database) and one storing more complex information about the image (keypoints and global descriptors). In order to localize the user, a query with an image of the current view is sent to the localization system. The server first calculates necessary information about the new image – its global descriptor and keypoints. Next, it performs a rough estimation of the similarity by calculating the distances between the global descriptors of the query image and images in the database. Subsequently, N records corresponding to the shortest distances are chosen and processed further by SuperGlue, which compares keypoints detected on the query image with keypoints identified on N chosen images from the database. Finally, user location is determined based on the number of matching keypoints.

That’s all we wanted to show you about VPR and our user localization system. We hope you found it interesting. In the next and last part of this series we will present how our localization system works in practice. Feel free to leave comments below if you have any questions. Stay tuned to read on!

If you want to find out more about nsFlow, please visit our website.

Do you wish to talk about the product or discuss how your industry can benefit from our edge AI & AR platform? Don’t hesitate to contact us!

In AI We Trust (but should we really?)

admin — Mon, 09 Aug 2021 10:06:40 +0000

Recent advancements in Artificial Intelligence (AI) led us to a point where AI-based technologies surround us and assist us in many daily routines. Some people may not realize it but AI is already an inherent part of our lives. Just think of how much time you spend on your smartphone which is loaded with AI algorithms. Whenever you take a photo, browse your gallery, or enjoy augmented reality features, AI assists you. Maybe you just scroll through Facebook or Instagram, or look for a new TV show on Netflix or music on Spotify – it is all powered by AI-based recommender systems. And it is not only about leisure activities. AI is already an integral part of automation and robotics, surveillance, e-commerce, agriculture 4.0, and it is also finding its way to many other sectors, like healthcare, human resources, or banking. AI wins with humans in chess, Go, and video games. It even started to replace human work in some areas. We, as individuals and as a society, have become dependent on AI. With the recent progress in deploying machine learning (ML) models on edge devices, we can only expect more and more AI systems around us. Are we ready for this? Are we reaching the technological singularity?

General artificial intelligence

Stephen Hawking once said: “A superintelligent AI will be extremely good at accomplishing its goals, and if those goals aren’t aligned with ours, we’re in trouble”. Fortunately, we have not even reached the level of general artificial intelligence. While some people believe it is not going to happen soon (or ever), many science and technology leaders are not so sure about that. Deep Mind’s researchers have just published “Reward is enough” [1]. They hypothesize that reinforcement learning agents can develop general AI by interacting with rich environments with a simple reward system. In other words, an agent, being driven by one goal – maximizing its cumulative reward – can learn the whole range of skills, which include knowledge, learning, perception, social intelligence, language, and generalisation.

It is likely that general AI is coming sooner or later. Meanwhile, the number of narrow AI solutions is growing exponentially. We need some tools to ensure that AI serves us well. Governments have already started to work on new laws regulating AI. Just this April European Commission released the Artificial Intelligence Act [2] to regulate AI technologies across the European Union.

The ethics of AI

At the same time, there is an ongoing global debate on AI ethics. However, it is difficult to define ethical AI. We do not have a precise definition of what is right and wrong – it depends on the context and cultural differences come into play. There are no universal rules that could be implemented in ethical AI systems. Not to mention an extra layer of difficulty it would introduce to the process of building deep learning solutions, which is already a hard task. It is possible that defining ethical AI will progress iteratively over time and – since we cannot predict all possible failures and their consequences – we have to make mistakes and hopefully learn from them. In any case, there are some measures we – as the community of AI developers – can take to ensure the quality, fairness, and trustworthiness of our software.

The important aspects of ML systems development

In a data-driven AI systems development, it is extremely important to understand and prepare data used to build a model. This is a necessary step to properly choose a methodology and create a reliable system. Moreover, this is required to minimize the effect of passing past prejudices and bias hidden in data into the final AI system. While it is tempting to start a project with prototyping deep learning algorithms it is crucial to first fully understand the data and underlying problems. If necessary domain experts should be involved in the process. It is also important, as for any software application, that proper security measures are applied to protect the data.

The next step in data preparation is a choice of training, validation, and test sets. Cross-validation is a commonly used technique to split a dataset into these three groups, and to verify the ability of a model to generalize for new samples. It has to be noted that while this method is widely adopted it has one drawback – there is a silent assumption that data is independent and identically distributed (the i.i.d. assumption). In most real-world scenarios i.i.d. does not hold. It does not mean that a cross-validation method should not be used, but AI developers must be aware of that to properly predict possible failures.

There is another well known problem related to the i.i.d. assumption, called domain or distribution shift. In short, it means that a training dataset is drawn from different distributions than real data used to feed an AI system after deployment. For example, a model trained solely on stock images may not work when it is later applied to users’ photos (different lighting conditions, quality, etc.) or an autonomous car which is learned how to drive during the daytime may not be able to perform this flawlessly at night. It is important that AI developers take into account that their model may fail in real life even if it works perfectly “in the lab”. And if possible, to use one of domain adaptation techniques to minimize the effect of distribution shift.

The right choice of metric is also crucial for an AI system development. Commonly used accuracy may be a good option for some image classification tasks, but it will fail to correctly represent the quality of a model for an imbalanced dataset. In such cases, F-score (the harmonic mean of precision and recall) is preferred. MAPE (Mean Absolute Percentage Error) is often chosen for regression tasks. However, it penalizes negative errors (prediction higher than true value) more than positive ones. If this is not desirable, sMAPE (symmetric MAPE) can be used instead. AI developers have to understand the advantages and shortcomings of metrics to choose the one adequate for a problem being solved.

Fig 1. The number of AI publications in the last 20 years. (source: AI Index Report 2021 [3])

Finally, one has to select an appropriate model for the task. There are thousands of AI publications every month (Fig 1). A lot of algorithms are proposed and AI developers have to choose the right one for a particular problem. It is hard to read every paper in the field, but it is necessary to at least know the state-of-the-art (SOTA) models and understand their pros and cons. It is important to follow new trends and to be aware of all groundbreaking approaches. Sometimes it is a matter of a few weeks for a model to lose its SOTA status (Fig 2).

Fig 2. Top models for image classification on ImageNet. (source: Papers with Code [4])

Understanding AI methods

Many frameworks have been released to accelerate the ML development process. Libraries like PyTorch or TensorFlow allow for quick prototyping and experimenting with various models. They also provide tools to make the deployment easy. Recently, AutoML (Automated Machine Learning) services, which allow non-experts to play around with ML models, have gained popularity. This is definitely a step forward to spread ML-based solutions across many different fields. However, choosing the right methodology and its deep understanding is still crucial to make a reliable AI system.

Regardless of the tool used, all the above aspects of the ML development process have to be considered carefully. It is important to remember that AI is goal-oriented and it may be misleading about its real performance. The researcher from the University of Washington reported how the shortcuts learned by AI may trick us that it knows what it is doing. Multiple models were trained to detect COVID-19 in radiographs and they performed very well on the validation set, which was created from the data acquired in the same hospital as the training set. However, they completely failed when applied to X-rays coming from a different clinic. It turned out that the models learned to recognize irrelevant features, like text markers, rather than medical pathology (Fig 3).

Fig 3. Saliency maps indicating the regions with the greatest influence on the models’ predictions. Strong activations come from the parts of images which do not represent medical pathology. (source: AI for radiographic COVID-19 detection selects shortcuts over signal [5])

Towards reliable and trustworthy AI

On the one hand, machine learning is now more accessible for developers and many interesting AI applications arise. On the other hand, people’s trust in AI grows inversely to the number of unreliable AI systems. It may take years before international standards arrive to certify the quality of AI-based solutions. However, it is time to start thinking about this. Recently, TUV Austria in collaboration with the Institute of Machine Learning at Johannes Kepler University released a white paper on how AI and ML tools can be certified [6]. They proposed a catalog to be used for auditing ML systems. At the moment, the procedure is provided only for supervised learning with a low criticality level. The authors list necessary requirements an AI system must meet and propose a complete workflow of the certification process. It is a great starting point to extend this in the future for other ML applications.

AI is everywhere. At this point, it “decides” what movie you are going to watch this weekend. In the near future, it may “decide” if you get a mortgage or about your medical treatment. The AI community has to make sure that the systems they are developing are reliable and fair. The AI developers need to have a comprehensive understanding of the methods they apply and data they use. Necessary steps must be taken to prevent prejudice and discrimination from data to be passed to AI systems. Fortunately, many researchers are aware of that and put a lot of effort into developing adequate tools, like explainable AI, to help create AI we can trust.

References

[1] Silver, David, et al. “Reward is enough.” Artificial Intelligence (2021): 103535.

[2] Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS

[3] Artificial Intelligence Index Report

[4] Papers with Code

[5] DeGrave, Alex J., Joseph D. Janizek, and Su-In Lee. “AI for radiographic COVID-19 detection selects shortcuts over signal.” Nature Machine Intelligence (2021): 1-10.

[6] Winter, Philip Matthias, et al. “Trusted Artificial Intelligence: Towards Certification of Machine Learning Applications.” arXiv preprint arXiv:2103.16910 (2021).

AGAR – Annotated Germs for Automated Recognition

admin — Tue, 27 Jul 2021 10:59:10 +0000

Deep learning (DL) algorithms are able to draw some conclusions by analyzing given data with or without provided annotations (for example in the form of labels). In the case of computer vision, the data is supposed to be a big pile of images, while the way they should be labeled depends on the problem to be solved. In this post, we will do a quick but broad in scope review of the database created during our studies, namely AGAR dataset, which stands for Annotated Germs for Automated Recognition database [1].

Why AGAR?

One of the challenges of modern microbiology is the automation of the process of recognizing species of microbes grown on an agar plate. Nowadays, the classification is done by analyzing specific morphological features such as shape, color, texture or size of microbial colonies grown on agar medium. This task requires specialist knowledge and often a lot of experience, as some microbes show similar characteristics. Moreover, the colonies of one species may vary in appearance depending on the time of breeding, growth medium or availability of nutrients. In recent years, the role of a specialist is increasingly being replaced by automatic image analysis. However, this is not easy as the dissimilarity between the different species of bacteria can be very subtle.

Preparing our own set of samples with photos of microbes was necessary because such a big and varied dataset did not exist in the publicly available collection. Although many researchers [2-5] have been concerned with the classification and counting of microorganisms, the number of provided images is definitely too small to optimize the parameters of the complicated neural network, such as object detector. In addition, mainly microscopic photos [2] or only small segments of agar plate culture [3-5] were made available.

Figure 1: Example photos of 5 microbial cultures on agar plates taken in 4 different setups with an illustration of its labeling.

It all starts with data

To enable an effective learning process of neural networks, it is important to have a similar number of samples for each type of microbes, as diverse as possible. Additionally, in case of supervised learning, all samples must be properly classified and described. For that purpose, an AGAR dataset is introduced.

Figure 2: Images of agar plates taken in 4 acquisition setups.

The AGAR is an image dataset for microorganism colonies detection and counting. This set contains 18k photos of Peri dishes taken by two cameras, under diverse lighting conditions, from very bright images to even more vague. We distinguish 4 acquisition setups as provided in Figure 2. The images are manually labeled by professional microbiologists using a bounding box and class tag to train and evaluate object detection algorithms. It contains in total 336 442 annotated colonies of five microorganisms (Staphylococcus aureus, Bacillus subtilis, Pseudomonas aeruginosa, Escherichia coli, and Candida albicans) most commonly used according to the Pharmacopoeia guidelines [6]. In addition, some defects (i.e. marks on agar surface) and contaminations (any unwanted microbiological contamination) were also labeled. AGAR database is not only photos of Petri dishes covered with countable (more than zero and less than 300 colonies) numbers of microbial colonies. The database also includes empty, and uncountable samples.

For plates with colonies number higher than 50, but less than 300, it is still possible to count by human, but colonies tend to agglomerate and are hard to distinguish from each other. The number of colonies higher than 300 was treated as uncountable — too high a number leads to counting errors due to overcrowding across the surface of the plate.

Figure 3: Examples of images of AGAR dataset with different numbers of colonies.

Exploratory data analysis

AGAR database contains several thousand samples for each type of bacteria, which additionally can be further extended by standard transformations such as rotation or mirroring. Exact knowledge of the distribution of instances for each microbial class might be helpful during the training and evaluation of the DL-based system. Sometimes it happens that misclassification or other errors are caused by the imbalance in the dataset or some misleading labels. Let’s try to extract from the AGAR dataset as much information as we could.

Especially the size of the bounding boxes is essential — it is a well-known fact that even the state-of-the-art detectors do not work well with small objects.

Figure 4: Proportion of images by different acquisition setup and countable groups.

A summary for our datasets showing the proportion of images grouped by background category is shown in Figure 4. Different shades for subgroups in the given background category indicates samples classified by experts as empty, countable, and uncountable. At the very beginning of the data collection process, we collected photos obtained with a higher resolution camera (bright + dark + vague, see Fig. 2). In this part we have chosen a low level of dilution to significantly reduce the number of empty samples, and therefore uncountable probes account for more than 33% of this part of our collection.

The number of instances per analyzed category is illustrated in Figure 5. AGAR dataset for combined bright and dark subgroups achieves a good balance in the number of instances of different microbe species, which is significantly helpful for learning a robust detector. It is a bit poorer in the case of a vague subgroup because two microbe classes are missing. Additionally, for this part of dataset, more than 80% of photos contain two different microorganisms, in contrast to the rest of the collection, where such images account for 15% (in case of photos from bright and dark subgroups) and 40% (in case of lower-resolution) of the total images of the subgroup. For the lower-resolution subgroup, there are clearly fewer instances for B.subtilis class. The most numerous microbe category is E.coli, and the least numerous group is B.subtilis.

Figure 5: The number of instances per category for all 5 microbe species and additional two classes – Defect and Contamination.

In case of the size variability of annotations per category for the whole dataset, we distinguished two bounding box size distributions: 0 – 128 px for C. albicans and S. aureus, and 16 – 512 px for P. aeruginosa, B.subtilis, and E. coli. In total (excluding defects and contamination), there are 154 630 bounding boxes with square root of area below 128px, 180 173 within 128 – 512px, and 1 639 above 512px. This wide range of sizes makes the detection task more challenging because models have to be flexible enough to handle the variety of instances’ dimensions.

Sum up

Nowadays, convolutional neural networks (CNN) are very successful in problems related to pattern recognition in images. The first attempts to use CNN in microbiology appear, however, they are not exhaustive. The creation of the huge and well-balanced AGAR database enables the design of a deep neural network for the detection and counting of microbial colonies grown on an agar substrate. Our main motivation was to prepare a universal model that can be successfully used in the analysis of various microbiological samples. The deep learning studies and their results will be described in our next post.

If you are interested in the full AGAR dataset for your research, you can find it on a dedicated page.

References

[1] Sylwia Majchrowska, Jarosław Pawłowski, Grzegorz Guła, Tomasz Bonus, Agata Hanas, Adam Loch, Agnieszka Pawlak, Justyna Roszkowiak, Tomasz Golan, and Zuzanna Drulis-Kawa. AGAR a Microbial Colony Dataset for Deep Learning Detection, 07 July 2021, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-668667/v1]

[2] Bartosz Zielinski, Anna Plichta, Krzysztof Misztal, Przemysław Spurek, Monika Brzychczy-Włoch, and Dorota Ochonska. Deep learning approach to bacterial colony classification. PloS One, 12(9), 2017.

[3] Alessandro Ferrari, and Alberto Signoroni. Multistage classification for bacterial colonies recognition on solid agar images. In 2014 IEEE International Conference on Imaging Systems and Techniques (IST) Proceedings, pages 101–106, 2014.

[4] Alessandro Ferrari, Stefano Lombardi, and Alberto Signoroni. Bacterial colony counting with convolutional neural networks in digital microbiology imaging. Pattern Recognition, 61:629 – 640, 2017.

[5] Mattia Savardi, Alessandro Ferrari, and Alberto Signoroni. Automatic hemolysis identification on aligned dual-lighting images of cultured blood agar plates. Computer Methods and Programs in Biomedicine, 156:13 – 24, 2018.

[6] European Pharmacopoeia, chapter 2.6.1 Sterility, page 155–158. Council of Europe, 6 edition, Ph. Eur. 6.0,01/2008:20601, 2007.