Artwork

Contenu fourni par Freddy Guime and Bob Paulin. Tout le contenu du podcast, y compris les épisodes, les graphiques et les descriptions de podcast, est téléchargé et fourni directement par Freddy Guime and Bob Paulin ou son partenaire de plateforme de podcast. Si vous pensez que quelqu'un utilise votre œuvre protégée sans votre autorisation, vous pouvez suivre le processus décrit ici https://fr.player.fm/legal.
Player FM - Application Podcast
Mettez-vous hors ligne avec l'application Player FM !

Episode 104. It's all about Apache Tika, the project that lets you index EVERYTHING.

1:16:21
 
Partager
 

Manage episode 413296731 series 8374
Contenu fourni par Freddy Guime and Bob Paulin. Tout le contenu du podcast, y compris les épisodes, les graphiques et les descriptions de podcast, est téléchargé et fourni directement par Freddy Guime and Bob Paulin ou son partenaire de plateforme de podcast. Si vous pensez que quelqu'un utilise votre œuvre protégée sans votre autorisation, vous pouvez suivre le processus décrit ici https://fr.player.fm/legal.

So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!

So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).

http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode

Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/

Apache Tika * https://tika.apache.org/

OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/ * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html

Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence

Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/

Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/

Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2

Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer

And Follow us! https://www.twitter.com/javapubhouse

  continue reading

108 episodes

Artwork
iconPartager
 
Manage episode 413296731 series 8374
Contenu fourni par Freddy Guime and Bob Paulin. Tout le contenu du podcast, y compris les épisodes, les graphiques et les descriptions de podcast, est téléchargé et fourni directement par Freddy Guime and Bob Paulin ou son partenaire de plateforme de podcast. Si vous pensez que quelqu'un utilise votre œuvre protégée sans votre autorisation, vous pouvez suivre le processus décrit ici https://fr.player.fm/legal.

So we continue to have guests in our show to talk to us about interesting things... This time is about Apache Tika. This is an incredible tool to do search file processing and metadata extraction. Think about that you have tons of unstructured files, like emails, or documents, and you want to extract, index and then search theses. This is Tika's purpose. And who best to walk us through how it does its magic that its Project Management Committee (PMC) Chair, Tim Allison!

So take a listen as we go deeper on ingesting tons of content (which is fundamental for things like training LLMs).

http://www.javapubhouse.com/datadog We thank DataDogHQ for sponsoring this podcast episode

Don't forget to SUBSCRIBE to our cool NewsCast OffHeap! http://www.javaoffheap.com/

Apache Tika * https://tika.apache.org/

OpenSearch Project and OpenSearch Neural Plugin Tutorials * https://opensearch.org/ * https://opensearch.org/docs/latest/search-plugins/neural-search/ * https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/ * https://opster.com/guides/opensearch/opensearch-machine-learning/opensearch-hybrid-search/ * https://sease.io/2024/01/opensearch-knn-plugin-tutorial.html * https://sease.io/2024/04/opensearch-neural-search-tutorial-hybrid-search.html

Selected Advanced File Processing toolkits/services * https://unstructured.io/ * https://aws.amazon.com/textract/ * https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence

Selected Hybrid Search/RAG toolkits (there are _MANY_ others!) * Haystack: https://haystack.deepset.ai/ * LangChain: https://www.langchain.com/ * LangStream: https://langstream.ai/

Search/Relevance Conferences * https://haystackconf.com/ * https://2024.berlinbuzzwords.de/ * https://mices.co/

Tim's personal project * JavaFX (ahem) tika-config writer UI: https://github.com/tballison/tika-gui-v2

Do you like the episodes? Want more? Help us out! Buy us a beer! https://www.javapubhouse.com/beer

And Follow us! https://www.twitter.com/javapubhouse

  continue reading

108 episodes

Alle episoder

×
 
Loading …

Bienvenue sur Lecteur FM!

Lecteur FM recherche sur Internet des podcasts de haute qualité que vous pourrez apprécier dès maintenant. C'est la meilleure application de podcast et fonctionne sur Android, iPhone et le Web. Inscrivez-vous pour synchroniser les abonnements sur tous les appareils.

 

Guide de référence rapide