staging: lustre: update the TODO list

As more people become involved with the progression of the lustre
client it needs to more clear what needs to be done to leave
staging. Update the TODO list with the various bugs and changes
to accomplish this. Some are simple bugs and others are far more
complex task that will change many lines of code. Some even cover
updating the user land utilities to meet the kernel requirements.
Several bugs have already been addressed and just need to be
pushed to the staging tree.

Signed-off-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit is contained in:
James Simmons 2018-02-11 18:00:08 -05:00 committed by Greg Kroah-Hartman
parent 56cabde291
commit 1b168a467f

View File

@ -1,12 +1,302 @@
* Possible remaining coding style fix.
* Remove deadcode.
* Separate client/server functionality. Functions only used by server can be
removed from client.
* Clean up libcfs layer. Ideally we can remove include/linux/libcfs entirely.
* Clean up CLIO layer. Lustre client readahead/writeback control needs to better
suit kernel providings.
* Add documents in Documentation.
* Other minor misc cleanups...
Currently all the work directed toward the lustre upstream client is tracked
at the following link:
https://jira.hpdd.intel.com/browse/LU-9679
Under this ticket you will see the following work items that need to be
addressed:
******************************************************************************
* libcfs cleanup
*
* https://jira.hpdd.intel.com/browse/LU-9859
*
* Track all the cleanups and simplification of the libcfs module. Remove
* functions the kernel provides. Possible intergrate some of the functionality
* into the kernel proper.
*
******************************************************************************
https://jira.hpdd.intel.com/browse/LU-100086
LNET_MINOR conflicts with USERIO_MINOR
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8130
Fix and simplify libcfs hash handling
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8703
The current way we handle SMP is wrong. Platforms like ARM and KNL can have
core and NUMA setups with things like NUMA nodes with no cores. We need to
handle such cases. This work also greatly simplified the lustre SMP code.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9019
Replace libcfs time API with standard kernel APIs. Also migrate away from
jiffies. We found jiffies can vary on nodes which can lead to corner cases
that can break the file system due to nodes having inconsistent behavior.
So move to time64_t and ktime_t as much as possible.
******************************************************************************
* Proper IB support for ko2iblnd
******************************************************************************
https://jira.hpdd.intel.com/browse/LU-9179
Poor performance for the ko2iblnd driver. This is related to many of the
patches below that are missing from the linux client.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9886
Crash in upstream kiblnd_handle_early_rxs()
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10394 / LU-10526 / LU-10089
Default to default to using MEM_REG
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10459
throttle tx based on queue depth
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9943
correct WR fast reg accounting
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10291
remove concurrent_sends tunable
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10213
calculate qp max_send_wrs properly
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9810
use less CQ entries for each connection
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10129 / LU-9180
rework map_on_demand behavior
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10129
query device capabilities
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10015
fix race at kiblnd_connect_peer
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9983
allow for discontiguous fragments
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9500
Don't Page Align remote_addr with FastReg
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9448
handle empty CPTs
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9507
Don't Assert On Reconnect with MultiQP
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9472
Fix FastReg map/unmap for MLX5
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9425
Turn on 2 sges by default
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8943
Enable Multiple OPA Endpoints between Nodes
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-5718
multiple sges for work request
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9094
kill timedout txs from ibp_tx_queue
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9094
reconnect peer for REJ_INVALID_SERVICE_ID
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8752
Stop MLX5 triggering a dump_cqe
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8874
Move ko2iblnd to latest RDMA changes
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8875 / LU-8874
Change to new RDMA done callback mechanism
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9164 / LU-8874
Incorporate RDMA map/unamp API's into ko2iblnd
******************************************************************************
* sysfs/debugfs fixes
*
* https://jira.hpdd.intel.com/browse/LU-8066
*
* The original migration to sysfs was done in haste without properly working
* utilities to test the changes. This covers the work to restore the proper
* behavior. Huge project to make this right.
*
******************************************************************************
https://jira.hpdd.intel.com/browse/LU-9431
The function class_process_proc_param was used for our mass updates of proc
tunables. It didn't work with sysfs and it was just ugly so it was removed.
In the process the ability to mass update thousands of clients was lost. This
work restores this in a sane way.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9091
One the major request of users is the ability to pass in parameters into a
sysfs file in various different units. For example we can set max_pages_per_rpc
but this can vary on platforms due to different platform sizes. So you can
set this like max_pages_per_rpc=16MiB. The original code to handle this written
before the string helpers were created so the code doesn't follow that format
but it would be easy to move to. Currently the string helpers does the reverse
of what we need, changing bytes to string. We need to change a string to bytes.
******************************************************************************
* Proper user land to kernel space interface for Lustre
*
* https://jira.hpdd.intel.com/browse/LU-9680
*
******************************************************************************
https://jira.hpdd.intel.com/browse/LU-8915
Don't use linux list structure as user land arguments for lnet selftest.
This code is pretty poor quality and really needs to be reworked.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8834
The lustre ioctl LL_IOC_FUTIMES_3 is very generic. Need to either work with
other file systems with similar functionality and make a common syscall
interface or rework our server code to automagically do it for us.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-6202
Cleanup up ioctl handling. We have many obsolete ioctls. Also the way we do
ioctls can be changed over to netlink. This also has the benefit of working
better with HPC systems that do IO forwarding. Such systems don't like ioctls
very well.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9667
More cleanups by making our utilities use sysfs instead of ioctls for LNet.
Also it has been requested to move the remaining ioctls to the netlink API.
******************************************************************************
* Misc
******************************************************************************
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9855
Clean up obdclass preprocessor code. One of the major eye sores is the various
pointer redirections and macros used by the obdclass. This makes the code very
difficult to understand. It was requested by the Al Viro to clean this up before
we leave staging.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9633
Migrate to sphinx kernel-doc style comments. Add documents in Documentation.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-6142
Possible remaining coding style fix. Remove deadcode. Enforce kernel code
style. Other minor misc cleanups...
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8837
Separate client/server functionality. Functions only used by server can be
removed from client. Most of this has been done but we need a inspect of the
code to make sure.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-8964
Lustre client readahead/writeback control needs to better suit kernel providings.
Currently its being explored. We could end up replacing the CLIO read ahead
abstract with the kernel proper version.
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9862
Patch that landed for LU-7890 leads to static checker errors
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-9868
dcache/namei fixes for lustre
------------------------------------------------------------------------------
https://jira.hpdd.intel.com/browse/LU-10467
use standard linux wait_events macros work by Neil Brown
------------------------------------------------------------------------------
Please send any patches to Greg Kroah-Hartman <greg@kroah.com>, Andreas Dilger
<andreas.dilger@intel.com>, and Oleg Drokin <oleg.drokin@intel.com>.
<andreas.dilger@intel.com>, James Simmons <jsimmons@infradead.org> and
Oleg Drokin <oleg.drokin@intel.com>.