Join the Community: https://go.mlops.community/YTJoinIn
Get the newsletter: https://go.mlops.community/YTNewsletter
Dask
What is it?
Parallelism for analytics
What is parallelism?
Doing a lot at once by splitting tasks into smaller subtasks, which can be processed in parallel (at the same time)
Distributed work across multiple machines and then combined the results
Helpful for CPU-bound - doing a bunch of calculations on the CPU. The rate at which the process progresses is limited by the speed of the CPU
Concurrency?
Similar a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap.
Shared state
Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem.
Multi-core vs distributed
Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading
Distributed across multiple nodes communicating via HTTP or RPC. Why is this hard?
Python has its challenges due to GIL; other languages don't have this problem
Shared state can lead to potential race conditions, deadlocks, etc
Coordinate work across the machines
For analytics?
Calculating some statistics on a large dataset can be tricky if it can’t fit in memory
// Show Notes
Coiled Cloud: https://cloud.coiled.io/
Coiled Launch Announcement: https://medium.com/coiled-hq/coiled-dask-for-everyone-everywhere-376f5de0eff4
OSS article: https://www.forbes.com/sites/glennsolomon/2020/09/15/monetizing-open-source-business-models-that-generate-billions/#2862e47234fd
Amish barn raising: https://www.youtube.com/watch?v=y1CPO4R8o5M
MessagePassingInterface: https://en.wikipedia.org/wiki/Message_Passing_Interface
----------- Connect With Us ✌️-------------
Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Matthew on LinkedIn: https://www.linkedin.com/in/matthew-rocklin-461b4323/
Timestamps:
0:00 - Intro to Matthew Rocklin and Hugo Bowne-Anderson
0:37 - Matthew Rocklin's Background
1:17 - Hugo Brown-Anderson's Background
3:47 - Where did that inspiration come from?
10:04 - Is there a close relationship between Best Practices and Tooling, or are these two separate things?
11:27 - Why is Data Literacy important with Coiled?
14:46 - How do you think about the balance between enabling Data Science to have a lot of powerful compute?
17:05 - Machine Learning as a space for tracking best practices experimentation
19:32 - What makes Data Science so difficult?
24:07 - How can a for-profit company complement Open Source Software (OSS)
29:40 - Amazon becoming a competitor with your own open-source technology (?)
32:50 - How do you encourage more people to contribute and ensure quality?
34:58 - Do you see Coiled operating within the DASK ecosystem?
37:30 - What is DASK?
39:19 - What should people know about parallelism?
41:28 - Why is it so hard to put things back together?
41:34 - Why does Python need a whole new tool to enable that? Or maybe some other tools as well?
44:44 - Dynamic Tasks Scheduling as being useful to Data Scientists
47:15 - Why is reliability in particular important in Data Science?
52:27 - What's in store for DASK?
Fler avsnitt av MLOps.community
Visa alla avsnitt av MLOps.communityMLOps.community med Demetrios finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
